Transformed Regression Type Estimators in the Presence of Missing
Observations: Case Studies on COVID-19 Incidence in Chiang Mai,
Thailand
NATTHAPAT THONGSAK1, NUANPAN LAWSON2*
1State Audit Office of the Kingdom of Thailand,
Bangkok, 10400,
THAILAND
2Department of Applied Statistics, Faculty of Applied Science,
King Mongkut’s University of Technology North Bangkok,
1518 Pracharat 1 Road, Wongsawang, Bangsue, Bangkok 10800,
THAILAND
Abstract: - The transformation technique can be used to modify the shape of the variable to improve the
performance of the population mean estimator. In the presence of missing data, before estimating the
population mean using standard statistical methods, missing data has to be taken care of. In this study, we
focus on new transformed regression type estimators when missing data are present in the study variable under
the uniform nonresponse mechanism and assume that the population mean of the auxiliary variable is
unavailable which usually occurs in practice. An auxiliary variable can assist by increasing the efficacy of
estimating the population mean. The bias and mean square error are investigated up to the first order degree
approximation using the Taylor series. A simulation and case studies on COVID-19 incidence in Chiang Mai,
Thailand are used to assess the performance of the new transformed estimators. The estimated number of
COVID-19 patients who have pneumonia and require high-flow oxygen and the estimated daily confirmed
cases of COVID-19 in Chiang Mai from the best proposed estimator are around 17 cases and 118 cases,
respectively.
Key-Words: - Transformed variable, missing data, COVID-19, uniformly nonresponse, fine particulate matter,
mean imputation, ratio imputation, bias, mean square error
Received: November 2, 2023. Revised: December 13, 2023. Accepted: January 22, 2024. Published: March 21, 2024.
1 Introduction
Changing the shape of the variables can be done by
using the transformation method. The well-known
transformed variable was suggested by [1], who
introduced changing an auxiliary variable for
estimating the population mean using the dual-to-
ratio estimator under simple random sampling
without replacement (SRSWOR). The transformed
auxiliary variable in [1], is:
*
SRS SRS
1 ; 1,2,3, ,
ii
x X x i N

, (1)
and the corresponding sample mean is
*
SRS SRS SRS
1,x X x

(2)
where
1
/
N
i
i
X x N
and
1
/n
n
i
i
xx
are the
population mean and sample mean of
X
respectively,
SRS /n N n

,
n
is a sample size
and
N
is a population size.
There are a plethora of works related the
transformed variables. For example, a linear
transformation of the study variable was suggested
to improve the ratio estimator, [2]. Based on the
work of [1], the author [3], suggested using the
transformation technique in [1] to improve the dual
to ratio estimators under SRSWOR assuming that
some known auxiliary parameters are available, [4],
[5].
Missing data is a prominent issue occurring in
sample surveys. Ignoring the missing data may lead
to bias and high variance. Imputation methods have
assisted in dealing with missing data, [6], [7], [8],
[9]. The mean imputation method is applied by
replacing the missing data with a sample mean of
the available information. The point estimator of the
mean imputation technique is:
Natthapat Thongsak, Nuanpan Lawson
E-ISSN: 2224-2902
131
Volume 21, 2024
S
ˆ,
r
Yy
(3)
where
1
1r
ri
i
yy
r
is the sample mean of the study
variable
.Y
The bias of
S
ˆ
Y
is
S
ˆ0.Bias Y
(4)
The variance of
S
ˆ
Y
is
22
S
11
ˆ,
y
V Y Y C
rN




(5)
where
/,
yy
C S Y
2
2
1
/ (N 1).
N
yi
i
S y Y
The ratio imputation is also another popular
imputation method to use when there is a connection
between an auxiliary and a study variable. The point
estimator for the ratio imputation method is:
Rat
ˆ,
n
r
r
x
Yy
x
(6)
where
1
/,
n
ni
i
x x n
and
1
/.
r
ri
i
x x r
The bias of
Rat
ˆ
Y
is:
2
Rat
11
ˆ,
x x y
Bias Y Y C C C
rn



(7)
and MSE of
Rat
ˆ
Y
is
2 2 2 2 2
Rat
1 1 1 1
ˆ2,
y y x x y
MSE Y Y C Y C C C C
n N r n
(8)
where
1
/,
N
i
i
Y y N
/,
xx
C S X
2
2
1
/ (N 1),
N
xi
i
S x X
and
/ ( ).
xy x y
S S S
The COVID-19 pandemic has afflicted the
entire world devastatingly in an abundance of areas
from way of life and the economy to global
healthcare. In terms of health, the impact of the
virus on humans goes much further than many
anticipate, as it not only presents with acute
respiratory infection symptoms but may continue to
linger on in the body in the form of organ damage.
Severe infection can permanently alter the immune
system. Efforts to produce vaccines and reduce
incidence through isolation measures to mitigate the
repercussions of the pandemic have been successful.
However, the spread of the virus is still ongoing for
years and can affect the more susceptible population
severely so researchers are delving deeper into the
issue and the etiologies to stop the virus. There are
innumerable risk factors for severe hospitalization
from COVID-19, one interesting aspect is the
influence of pollution. Thailand is one of the
countries with pollution from fine particulate matter
as a prevailing problem for many years. It affects
areas all over the country and is mostly caused by
burning agricultural waste every year. Research has
found a correlation between exposure to fine
particulate matter and COVID-19 hospitalization
and incidence rates. The pollution can permanently
damage the body’s immune system, increasing the
risk of viral infection and severe symptoms. Fine
particulate matter exceeding safe levels in Thailand,
especially Chiang Mai, is an obstinate concern that
increases the risk for a myriad of non-
communicable diseases that are the leading causes
of death. Examining these levels is critical for the
prevention of further consequences of pollution
through national policies and measures, albeit data
regarding fine particulate matter are often missing.
Many researchers investigated the connection
between COVID-19 and air pollution data. The
studies indicate that there is a positive correlation
between COVID-19 and air pollution data such as
PM2.5, [10], [11], [12], [13], [14]. Missing data
occur in many real data including COVID-19 and
fine particulate matter and as a result the proper
statistical techniques should be applied to deal with
these data.
In this study, a class of regression type
estimators utilizing the transformation of an
auxiliary variable has been proposed using simple
random sampling (SRSWOR). The uniform
nonresponse mechanism is considered in this study
and it is assumed the population mean of the
auxiliary variable is unknown. This study
investigates the bias and mean square error of the
proposed estimators. A simulation study and
applications to COVID-19 incidence in Chiang Mai,
Thailand are studied using the proposed transformed
estimators.
2 Proposed Estimator
Inspired by [3], assuming that the study variable is
missing under the uniform nonresponse mechanism
and the population mean of an auxiliary variable is
unknown, a class of regression type estimators for
estimating population mean utilizing the
transformation of an auxiliary variable is proposed
as below.
Natthapat Thongsak, Nuanpan Lawson
E-ISSN: 2224-2902
132
Volume 21, 2024
*
*
N
ˆrn
n
Ax D
Y y b x x Ax D





, (7)
where
*
1 ,
nr nr
nx rx r
x x x
n r n r

,
2
xy
x
s
bs
is the sample regression coefficient,
0,AD
are real numbers or functions of the
auxiliary variable.
The following notations are used to obtain the
bias and MSE of the proposed estimator.
00
,1
rr
yY
yY
Y

,
1,
r
xX
X
1
1,
r
xX

22
,1
nn
xX
xX
X

,
0 1 2 0,E E E
2 2 2 2 2 2
0 1 2
1 1 1 1 1 1
, , ,
y x x
E C E C E C
r N r N n N

0 1 0 2
2
12
1 1 1 1
, ,
11 , .
x y x y
x
E C C E C C
r N n N
X
E C K
n N Y




Rewriting
N
ˆ
Y
in terms of
' , 0,1,2
i
e s i
, we
have:
*
*
N
ˆrn
n
Ax D
Y y b x x Ax D





0 1 2
2 2 1
2
1
.
Y b X
AX D AX
AX D AX








Let
AX
AX D
, will get
N 0 1 2
2 2 1
2
ˆ1Y Y b X
AX AX
AX AX









0 1 2
1
2 2 1 2
1
1 1 .
Y bK bK
Using the Taylor series approximation, we get:
N 0 1 2
22
2 2 1 2 2
ˆ1
11
Y Y bK bK
Then the approximation of the bias of
N
ˆ
Y
is:
NN
22
0 2 1 2 2 2 1
2 2 2
2 2 2 2 0 1 2 0 2
22
2 2 1 2
ˆˆ
()
2
Bias Y E Y Y
bK bK bK
YE bK
bK
  















2
11
x x y
Y KC C C
rn



(8)
Under the assumption that the terms of
involving the powers more than two are negligibly
small, the mean square error of
N
ˆ
Y
is:
2
N 0 1 2
ˆ
MSE Y E Y bK bK


2 2 2
222
1 1 1 1
2.
y
x x y
Y C Y
r N r n
K C K C C
(9)
Some proposed estimators are shown in Table 1.
Table 1. Some members of the proposed estimator
Natthapat Thongsak, Nuanpan Lawson
E-ISSN: 2224-2902
133
Volume 21, 2024
where
1
and
2
are the coefficient of skewness
and kurtosis of the auxiliary variable, respectively,
1 2 3
, , and Q Q Q
are the first, second, and third
quartiles of the auxiliary variable, respectively,
31
/2
a
Q Q Q
and
31
/2
d
Q Q Q
is the
quartile mean and quartile deviation of the auxiliary
variable, respectively.
3 Efficiency Comparison
The proposed estimator’s (
N
ˆ
Y
) efficiency is
compared with the existing estimators; mean
imputation estimator (
S
ˆ
Y
), and ratio imputation
estimator (
Rat
ˆ
Y
) based on the MSE.
1)
N
ˆ
Y
performs better than
S
ˆ
Y
if:
NS
ˆˆ
MSE Y MSE Y
2
2 2 2 2 2
22
1 1 1 1 2
11
y x x y
y
Y C Y K C K C C
r N r n
YC
rN





222 20
x x y
K C K C C
.
2
x
y
KC
C
(10)
2)
N
ˆ
Y
performs better than
Rat
ˆ
Y
if
N Rat
ˆˆ
MSE Y MSE Y
2
2 2 2 2 2
1 1 1 1 2
y x x y
Y C Y K C K C C
r N r n
2 2 2 2
1 1 1 1 2
y x x y
Y C Y C C C
r N r n
2
2 2 2
22
x x y x x y
K C K C C C C C
1.
2
x
y
KC
C

(11)
4 A Simulation Study
To assess the performance of the proposed
estimators, the data was generated from bivariate
normal distribution with the following parameters;
4,000, 150, 280, 2.2, 1.1, 0.8
xy
N X Y C C
Then 40% of the study variable values were
randomly assigned as missing and we randomly
selected a sample of 25% units from a population of
size
4,000N
using the SRSWOR scheme.
The biases and MSEs of the proposed and
existing estimators are represented in Table 2, where
10,000
1
1
ˆˆ
,
10, 000 i
i
Bias Y Y Y

(12)
2
10,000
1
1
ˆˆ
.
10, 000 i
i
MSE Y Y Y

(13)
Table 2. Biases and MSEs of the estimators
According to Table 2, the proposed estimators
performed superior to the existing estimators. Both
the bias and MSE of the proposed estimators are
smaller than with the mean and ratio imputation
methods. The best estimator is
N2
ˆ
Y
that utilized the
coefficient of kurtosis of the auxiliary variable
which gave the smallest bias and MSE.
5 Applications to COVID-19 Incidence
In this section, the COVID-19 dataset collected
from Chiang Mai province, [15] and daily PM2.5
concentration, [16], between 1 April 2022 and 31
July 2022 (population size
122N
) were applied
to illustrate the efficiency of the proposed
estimators.
Population I: we assigned the number of
COVID-19 patients who have pneumonia and
require high-flow oxygen as the study variable, and
the PM2.5 concentration (micrograms per cubic
Natthapat Thongsak, Nuanpan Lawson
E-ISSN: 2224-2902
134
Volume 21, 2024
meters) as the auxiliary variable. The population
parameters are described as follows:
122, 18.16, 17.11, C 0.85, 0.60, 0.62
xy
N X Y C
Population II: we assigned the daily confirmed
cases of COVID-19 as the study variable, and the
PM2.5 concentration (micrograms per cubic meter)
as the auxiliary variable. The population parameters
are described as follows:
122, 18.16, 117.88, C 0.85, 1.19, 0.79
xy
N X Y C
A sample of size
36n
is acquired from the
population of size
122N
using SRSWOR with
30% and 25% missing in the study variable for
population I and II, respectively.
Figure 1 shows the scatter plot between PM2.5
concentration and the number of COVID-19 patients
who had pneumonia and required high-flow oxygen.
Figure 2 shows the scatter plot between PM2.5
concentration and the daily confirmed cases of
COVID-19. The estimated number of COVID-19
patients who have pneumonia and require high-flow
oxygen, estimated daily confirmed cases of COVID-
19, and the PREs of the estimators are calculated
using the R program, [17], which are presented in
Table 3.
Fig. 1: The scatter plot between PM2.5
concentration and the number of COVID-19 patients
who had pneumonia and required high-flow oxygen
Fig. 2: The scatter plot between PM2.5
concentration and the daily confirmed cases of
COVID-19
Table 3. Estimated number of COVID-19 patients
who have pneumonia and require high-flow oxygen,
estimated daily confirmed cases of COVID-19, and
PREs of the estimators
Figure 1 and Figure 2 indicate that both the
number of COVID-19 patients who have pneumonia
and require high-flow oxygen and daily confirmed
cases of COVID-19 have a positive relation with
PM2.5 concentration and the correlation coefficient
is 0.62 and 0.79, respectively.
Table 3 indicates the proposed estimators gave
better results in terms of PREs in comparison to the
mean imputation method. The proposed estimators
produced both the estimated number of COVID-19
patients who have pneumonia and require high-flow
Natthapat Thongsak, Nuanpan Lawson
E-ISSN: 2224-2902
135
Volume 21, 2024
oxygen and the estimated daily confirmed cases of
COVID-19 closer to the population mean than other
estimators and the PREs of the proposed estimators
are higher than the mean and ratio imputation
estimators, especially
N3
ˆ
Y
that utilized the
coefficient of variation and
N2
ˆ
Y
that utilized the
coefficient of kurtosis of the auxiliary variable to
increase the precision of the estimator for
population mean for population I and II,
respectively. The estimated number of COVID-19
patients who have pneumonia and require high-flow
oxygen and the estimated daily confirmed cases of
COVID-19 in Chiang Mai from the best proposed
estimator are around 17 cases and 118 cases,
respectively.
6 Conclusion
Transformed estimators have been introduced in the
presence of missing data with SRSWOR to improve
the performance of the population mean estimator.
Employing the transformation method can support
altering the form of the variable which results in
increasing the performance of the population mean
estimator by assuming the auxiliary variable’s
population mean is not known which usually occurs
in practice. As a result, it is going to be helpful in
practice. The bias and mean square error of the
transformed estimators are investigated. The results
illustrated the newly transformed estimators gave
the least bias and mean square error compared to
others and gave closer estimated values of COVID-
19 incidence to the population values. Especially the
ones using the coefficient of variation and the
coefficient of kurtosis of the auxiliary variable gave
a high improvement in terms of highest PREs
concerning other estimators. For future work, the
suggested estimators may be applicable to assist
with other survey designs e.g. stratified random
sampling, double sampling, and cluster sampling,
and in more flexible nonresponse mechanisms.
Moreover, the estimators can be extended to cover
the case that the missing data appears in the
auxiliary variable or both study and auxiliary
variables. The proposed estimators are very useful
in practice in estimating the variable of interest in
real data when nonresponse occurs in the study.
Acknowledgement:
We are thankful for all the helpful comments from
the unknown referees to improve the paper.
References:
[1] Srivenkataramana, T., A dual to ratio
estimator in sample surveys, Biometrika,
Vol. 67, No. 1, 1980, pp.199-204.
[2] Singh, A., Mourya, K.K. and Sisodia, B.V.S.,
Comparison of some ratio estimators using
linear transformation, International of
Current Microbiology and Applied Sciences,
Special issue – 9, 2019, pp. 57-67.
[3] Thongsak, N. and Lawson, N., Classes of
dual to modified ratio estimators for
estimating population mean in simple
random sampling, Proceedings of the 2021
Research, Invention and Innovation
Congress, Bangkok, Thailand, September, 1-
2, 2021, pp. 211-215.
[4] Upadhyaya, L.N. and Singh, H.P., Use of
transformed auxiliary variable in estimating
the finite population mean, Biometrical
Journal, Vo. 41, 1999, pp. 627–636.
[5] Onyeka, A.C., Nlebedim, V.U. and Izunobi,
C.H., A Class of estimators for population
ratio in simple random sampling using
variable transformation, Open Journal of
Statistics, Vol.4, 2014, pp.284-291.
[6] Nangsue, N., Adjusted ratio and regression
type estimators for estimation of population
mean when some observations are missing,
International Scholarly and Scientific
Research & Innovation, Vo.3, No.5, 2009,
pp. 334-337.
[7] Singh, S., and Horn, S., Compromised
imputation in survey sampling, Metrika,
Vol.51, No. 3, 2000, pp. 267–276.
[8] Singh, S. and Deo, B., Imputation by power
transformation, Statistical. Papers, Vol. 44,
2003, pp. 555–579.
[9] Singh, A. K., Singh, P., and Singh, V.,
Exponential-type compromised imputation in
survey sampling, Journal of Statistics
Applications & Probability, Vol.3, No.2,
2014, pp.211- 217.
[10] Jiang, Y., Wu, X.-J., Guan, Y.-J, Effect of
ambient air pollutants and
meteorological variables on COVID-19
incidence, Infection Control Hospital
Epidemiology, Vol. 41, No. 9, 2020,
pp.1011–1015.
[11] Chan, S., Chu, J.Zhang, Y, Nadarajah, S,
Count regression models for COVID-19,
Physica A, Vol. 563, 2019, 125460.
[12] Martelletti, L., Martelletti, P, Air pollution
and the novel covid-19 disease: a
putative disease risk factor. Sn
Natthapat Thongsak, Nuanpan Lawson
E-ISSN: 2224-2902
136
Volume 21, 2024
comprehensive Clinical Medicine, Vol. 2,
No. 4, 2020, pp. 383–387.
[13] Beloconi, A. and Vounatsou, P, Long-term
air pollution exposure and COVID-19 case-
severity: An analysis of individual-level data
from Switzerland. Environmental Research,
Vol. 216, 2023, pp. 114481.
[14] Austin, W., Carattini, S, Gomez-Mahecha, J.
and Pesko, M.F., The effects of
contemporaneous air pollution on COVID-19
morbidity and mortality. Journal of
Environmental Economics and Management,
Vol. 119, 2023, pp. 102815.
[15] Chiang Mai, Covid-19 situation in Chiang
Mai Province, (2023), [Online].
https://www.chiangmai.go.th/covid19//
(Accessed Date: October 20, 2023).
[16] Pollution Control Department, Daily PM2.5
concentration, (2023), [Online].
http://air4thai.pcd.go.th/webV3/#/History
(Accessed Date: October 20, 2023).
[17] R Core Team, R: A language and
environment for statistical computing. R
Foundation for Statistical Computing,
Vienna, Austria, 2021, [Online].
https://www.R-project.org/ (Accessed Date:
November 5, 2023).
Contribution of Individual Authors to the
Creation of a Scientific Article (Ghostwriting
Policy)
The authors equally contributed to the present
research, at all stages from the formulation of the
problem to the final findings and solution.
Sources of Funding for Research Presented in a
Scientific Article or Scientific Article Itself
This research was funded by the National Science,
Research and Innovation Fund (NSRF), and King
Mongkut’s University of Technology North
Bangkok Contract no. KMUTNB-FF-67-B-43.
Conflict of Interest
The author has no conflicts of interest to declare.
Creative Commons Attribution License 4.0
(Attribution 4.0 International, CC BY 4.0)
This article is published under the terms of the
Creative Commons Attribution License 4.0
https://creativecommons.org/licenses/by/4.0/deed.en
_US
Natthapat Thongsak, Nuanpan Lawson
E-ISSN: 2224-2902
137
Volume 21, 2024