Development of Regression Models for COVID-19 Trends in Malaysia
SOFIANITA MUTALIB1,3*, SITI NURJEHA MOHD PUNGUT4, AIDA WATI ZAINAN ABIDIN2,
SHAMIMI A HALIM1, ISKANDAR SHAH MOHD ZAWAWI2
1School of Computing Sciences, College of Computing, Informatics and Mathematics,
Universiti Teknologi MARA,
40450 Shah Alam, Selangor,
MALAYSIA
2School of Mathematical Sciences, College of Computing, Informatics and Mathematics,
Universiti Teknologi MARA,
40450 Shah Alam, Selangor,
MALAYSIA
3Research Initiative Group Intelligent Systems,
Universiti Teknologi MARA,
40450 Shah Alam, Selangor,
MALAYSIA
4Xplode Media Private Limited,
Lot No. A-07-2 Paragon Point, Seksyen 9 Pusat Bandar Baru Bangi Bangi, 43650, Selangor,
MALAYSIA
*Corresponding Author
Abstract: - COVID-19 has emerged as the biggest threat to the world's population, since December 2019. There
have been fatalities, financial losses, and widespread fear as a result of this extraordinary occurrence, especially in
Malaysia. Using available COVID-19 data from the Ministry of Health (MOH) Malaysia website, from 25/1/2020
to 17/6/2022, this study generated regression models that describe the trends of COVID-19 cases in Malaysia,
taking into account the unpredictable nature of COVID-19 cases. Three techniques are used in Weka software:
60:40 / 70:30 split ratio, 10 and 20-fold cross-validation, Support Vector Regression (SVR), Multi Linear
Regression (MLR), and Random Forest (RF). Based on new instances among adults, the study's findings indicate
that RF has the strongest coefficient correlation and the lowest Root Mean Square Error of 22.7611 when it comes
to predicting new COVID-19 deaths in Malaysia. Further investigation into prospective characteristics like
vaccination status and types, as well as other external factors like locations, could be added to this study in the
future.
Key-Words: - COVID-19, Regression Models, Random Forest, Support Vector Regression, Linear Regression,
Supervised.
Received: August 27, 2022. Revised: September 28, 2023. Accepted: October 7, 2023. Published: November 3, 2023.
1 Introduction
Since the first case was reported in Wuhan in 2019,
the Coronavirus Disease 2019 (COVID-19) has been
affecting the world for nearly two years, resulting in
numerous cases and unnecessary deaths. Beginning at
the start of December 2019, this disease spread
rapidly throughout Wuhan City, Hubei Province,
China, [1]. The World Health Organization declared
SARS-COVID-19 to be a global pandemic on March
11, 2020. Right after the first case outside of China
was identified, Malaysia began its strict screening by
prohibiting the entry of foreigners and closing its
border. After that, Malaysia experienced new waves
of the disease until April 2020, with the first wave
being completely and effectively managed and the
second wave beginning in early March 2021 with
cautious measures taken.
The number of COVID-19 cases in Malaysia
continues to rise daily, according to data from
Malaysian states through January 7, 2022. The state
with the highest number of confirmed cases is
Selangor, with 792043, followed by Sarawak and
Johor with 252461 and 247439 cases, respectively.
Wilayah Persekutuan Labuan, Wilayah Persekutuan
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2023.20.42
Sofianita Mutalib, Siti Nurjeha Mohd Pungut,
Aida Wati Zainan Abidin, Shamimi A Halim,
Iskandar Shah Mohd Zawawi
E-ISSN: 2224-3402
398
Volume 20, 2023
Putrajaya, and Perlis have a low number of COVID-
19 cases, with 10897, 9569, and 7321 confirmed
cases, respectively. The monitoring also applies to the
total number of deaths broken down by the states. The
highest fatality rates are still seen in Johor and
Selangor, with 3876 and 9988 deaths, respectively.
Sarawak had only 1618 total deaths, which is a
modest amount considering it was the second state
with the most verified cases.
Based on publicly available statistics, this
epidemic began when many people at a Wuhan fish
market contracted the virus. In addition to seafood,
this market sells a broad variety of unusual animals,
including birds, snakes, marmots, and bats. Because
of their diet, they are high-risk hosts of several
viruses and bacteria. Because of the sharp increase in
reported cases, the Malaysian government has begun
implementing preventative measures to halt the
virus's spread. To halt the global spread of SARS-
CoV-2, traditional measures must be put into place
and adhered to, [2]. The Malaysian government has
recommended several preventive measures, including
frequent hand washing, avoiding handshakes with
strangers, wearing masks and gloves, maintaining a
minimum of one metre of social space when walking
outside, and avoiding crowds, to stop the virus from
spreading. It has been demonstrated that the adoption
of non-pharmaceutical social distancing or lockdown
measures has significantly reduced the pandemic's
scope, [3].
Health systems can better organize their
resources, monitor outbreak management, and get
ready for pandemics by forecasting the COVID-19
virus. It is useful to use mathematical and statistical
models to forecast how infectious diseases may
develop. Researchers have been conducting
experiments and analyzing the COVID-19 prediction.
Several models have been used to forecast the number
of COVID-19 cases. The Autoregressive Integrated
Moving Average (ARIMA) model, [4], is an example
of a statistical model that was used to forecast
COVID-19 cases and estimate COVID-19 in Italy,
Spain, and France, [5]. In addition, a polynomial
model for daily COVID-19 case forecasting was
proposed, [6]. Susceptible-Infected-Removed (SIR)
model example to estimate and analyze the COVID-
19 spread in Kuwait with fixed variables, [7], and
estimation of the final size of the COVID-19
epidemic in Pakistan based on the reported cases, [8].
A few additional studies contemplate utilizing the
traditional Susceptible-Exposed-Infectious-Removed
(SEIR) model, [9], [10], [11].
Our study aims to test several regression
algorithms for prediction related to COVID-19 new
deaths in Malaysia. The regression models can be
useful in analyzing the non-linear relations in
historical cases, as aforementioned. Therefore, three
different regression models are applied for this
prediction. The contribution of this paper is the
different experiments in predicting target (new death)
based on selected scenarios: general, adolescent,
children, and adult. Therefore, the developed models
can be used for monitoring the spread of the
pandemic.
The rest of this paper is structured as follows:
Section 2 provides the related works, and Section 3
presents the literature about predictive models.
Section 4 introduces the methods and dataset used in
this study. The results and findings are given in
Section 5. Section 6 concludes the study.
2 Related Works
When the COVID-19 disease was identified in China,
it has been a lot of effort was made to test many types
of treatments and solutions across multiple
populations to reduce its impact and spread. Since the
first case of the current pandemic, COVID-19 was
identified more than two years ago, the immunization
program for this year in Malaysia did not begin until
early February, with a primary focus on immunization
and infection control. Herd immunity reduces the
likelihood of ineffective contact between a
susceptible individual and an infected host, thereby
offering indirect protection to susceptible members of
a sufficiently immune group. In its most fundamental
form, herd immunity occurs when a population
reaches the herd immunity threshold, or when a
certain proportion of people are immune to a virus,
[12].
Numerous viral pandemics, including H1N1,
have utilized the SIR model, demonstrating that
modelers can approximate disease behavior by
predicting a small number of parameters. The SIR
model is a mathematical method that may have
difficulty being applied to the dynamics of an
epidemic. Additionally, the model may suffer if the
procedure is overly simplified. The revised data for
the COVID-19 pandemic are complex and daily;
therefore, they must be discarded in day-series order.
The SIR model also assumes that total immunity can
be acquired through infection, thereby including the
epidemiological concept of natural herd immunity.
However, the authors noted earlier that the SIR model
could not account for such dynamics in this
pandemic, so employing artificial intelligence or
machine learning techniques would strengthen the
reliability of COVID-19 data, [13], [14]. Awareness
of the SIR model in COVID-19 can help researchers
reduce other infectious diseases as well as other
problems, such as computer viruses and natural
disasters.
The COVID-19 pandemic has had some influence
and impact on other studies as well, such as mental
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2023.20.42
Sofianita Mutalib, Siti Nurjeha Mohd Pungut,
Aida Wati Zainan Abidin, Shamimi A Halim,
Iskandar Shah Mohd Zawawi
E-ISSN: 2224-3402
399
Volume 20, 2023
health and economic sectors. Studies used statistical
and also machine learning techniques for sentiment
analysis, for MySejahtera apps, which are mandatory
for all Malaysians to report their movements, [15]. In
analyzing the feeds and tweets in social media, Latent
Dirichlet Allocation (LDA) is implemented to view
the three main topics and issues related to mental
health, [16]. Meanwhile, Naive Bayes were tested
based on VADER and TextBlob scores to classify the
records into positive and negative sentiments. The
next section explains the machine learning algorithms
used in developing predictive regression models,
which were mainly adapted in our study for COVID-
19 in Malaysia.
3 Predictive Models
Predictive models are used to predict future events,
including in economics and also in public health
areas. Predictive models are also dynamics that are
regularly updated or verified to take into
consideration changes to the underlying data.
Predictive models rely on assumptions that are
grounded in historical and contemporary events. In
the case of COVID-19, the development of a
predictive model is demonstrated by several research
papers.
There is a study that used IoT-based technologies
and machine learning to rapidly identify the spread of
coronavirus cases, monitor the clinical outcomes of
survivors, and collect and analyze pertinent data to
establish the presence of the virus, [17]. The
collected data culminated in an 80-symptom list, and
this study utilized eight algorithms to compare the
precision of providing information or tracking
COVID-19 cases for each patient. The results show
that the machine learning algorithms achieved a good
accuracy rate that exceeded 90%, demonstrating that
this method of tracking and monitoring is effective.
Another research paper attempts to forecast when
infected patients will recover or not (released or
deceased) using machine learning algorithms, [18].
The variables in the dataset include gender, age,
infection case, and number of days. The Decision
Tree (DT) model was found to be the most accurate
with a 99.85% accuracy rate, followed by Random
Forest (RF) with 99.60%, Support Vector Machine
(SVM) with 98.85%, K-nearest neighbor (KNN) with
98.06%, Naive Bayes (NB) with 97.52%, and Linear
Regression (LR) with 97.49%. The developed models
would be of great assistance in the healthcare
industry's battle against COVID-19. Table 1 shows
more research papers and the algorithms applied in
the COVID-19 study.
Table 1. Machine learning algorithms used in
building predictive models.
Reference
Description
Techniques and
results
[19]
A prediction model
based on demographic
and clinical features,
with target: positive
or negative.
Logistic Regression,
DT, SVM, NB
Best result: SVM
(accuracy 93.34%)
[17]
A framework with
IoT devices to
monitor and track
survivors' clinical
measures (80
symptoms)
Neural Network,
Decision Table,
SVM, NB, k-NN,
Dense Neural
Network (DNN)
more than 90%
LSTM & OneR
less than 90%
[20]
Target: Number of
active cases,
death and recovery.
LASSO,
RF, DT Regressor,
LR, SVM,
Polynomial
Regression
The performance of
the algorithms varies.
[21]
20 attributes that are
possible factors
related to acquiring
the virus, to predict
whether it is positive
or not.
J48 DT, RF, SVM,
k-NN, NB, MLP,
LR, ANN
SVM is the best
model, RF is the
second-best model.
3.1 Multiple Linear Regression
Multiple regression, or multiple linear regression
(MLR), is a statistical technique that combines
multiple explanatory variables to predict the outcome
of a response variable. Using multivariate linear
regression, the mathematical relationship between
random variables is established. A study focuses on
estimating solar radiation using multi-linear
regression techniques, artificial neural networks, and
empirical equations such as the Hargreaves equation,
[22]. Meanwhile, another study compared the
predictability of the monthly streamflow using MLR,
ANN, ANFIS, and KNN, [23]. The results
demonstrated that all three nonlinear models, ANN,
ANFIS, and KNN, performed admirably, with the
ANFIS model outperforming the others due to its
utilization of both fuzzy inference systems and neural
networks. MLR models have a different advantage in
model development, as they are designed to replicate
a linear relationship between inputs and outputs,
though, in their study, they failed to accurately
forecast monthly flows.
3.2 Random Forest
The output of the RF algorithm for classification
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2023.20.42
Sofianita Mutalib, Siti Nurjeha Mohd Pungut,
Aida Wati Zainan Abidin, Shamimi A Halim,
Iskandar Shah Mohd Zawawi
E-ISSN: 2224-3402
400
Volume 20, 2023
problems is the class selected by the majority of trees.
According to, [24], this study uses the RF algorithm
to assess and issue warnings regarding the security
risk associated with extensive group activities by
conducting experiments with a 10-fold cross-
validation method. In the study, [24], a comparison
was done between RF and KNN, NB, and the CART
algorithms. The results revealed that RF achieved the
highest accuracy of 86% while KNN achieved 81%
accuracy, NB achieved 74% and the CART algorithm
achieved the lowest accuracy out of the four
algorithms, indicating that RF is a reliable method for
predicting ability, [25].
Another study demonstrates that RF models are
superior to decision trees, as the accuracy of random
forest models for both the train and test sets was
slightly higher than that of decision tree models, [26].
The purpose of this study is to evaluate and contrast,
on a regional scale, the performance of two cutting-
edge machine learning models, DT and RF model,
about seven modeled major rainfall-triggered
landslides on the Japanese island of Izu-Oshima. This
investigation's samples were chosen based on the
presence or absence of landslip data, and a
classification tree was constructed. At each branch
node, a random subset of the potential cause factors is
selected.
3.3 Support Vector Regression
Support Vector Regression (SVR), a supervised
learning method, is used to forecast discrete values.
SVR and SVM are based on a similar concept. The
primary objective of SVR is to locate the optimal line.
For SVR, the model with the hyperplane with the
most points is the optimal fit line. A research paper
discovered that it is possible to combine SVR with
any other technique. The purpose of this study is to
compare the best model for predicting the average
monthly temperature in Iran and to demonstrate the
effectiveness of combining SVR with the Firefly
optimization algorithm, [27].
In another study, SVR and genetic algorithm
(GA) are combined to predict the water temperature
in numerous reservoirs, [28]. This study employed the
root-mean-square error (RMSE), mean absolute error
(MAE), mean absolute percentage error (MAPE), and
Nash-Sutcliffe efficiency coefficient (NSE) to
compare the accuracy of these strategies. The results
demonstrate that the GA-SVR model outperforms the
SVR model on all metrics, while the M-GASVR
model is superior, and the ANN model is the least
effective of the four models.
4 Methodology
4.1 Data Acquisition
In this phase, our study manipulates the available data
and the references from previous research and uses
them as the starting point for the modeling of the
COVID-19 trend. The daily data was obtained from
the website of our Ministry of Health (MOH), [29].
Following the readings, previous studies identified
the patterns that might influence the instances of
COVID-19 fatalities and used them as variables to
assess the accuracy of each model. Collecting
COVID-19 datasets was one of the activities
performed during this phase. These datasets contain
actual information regarding Malaysia's daily cases,
daily deaths, and population. These datasets capture
the records from 25/1/2020 17/6/2022. To view the
daily trends, the plotted graphs are provided in Figure
1 for the daily new cases of COVID-19, Figure 2 for
the daily new deaths, Figure 3 for the number of
vaccination recipients, and Figure 4 is the average
number of daily new cases based on the category.
Fig. 1: The daily new cases of COVID-19
Fig. 2: The number of daily new deaths
0
100
200
300
400
500
600
700
1/25/2020
2/25/2020
3/25/2020
4/25/2020
5/25/2020
6/25/2020
7/25/2020
8/25/2020
9/25/2020
10/25/2020
11/25/2020
12/25/2020
1/25/2021
2/25/2021
3/25/2021
4/25/2021
5/25/2021
6/25/2021
7/25/2021
8/25/2021
9/25/2021
10/25/2021
11/25/2021
12/25/2021
1/25/2022
2/25/2022
3/25/2022
4/25/2022
5/25/2022
No. of new deaths
deaths_new
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2023.20.42
Sofianita Mutalib, Siti Nurjeha Mohd Pungut,
Aida Wati Zainan Abidin, Shamimi A Halim,
Iskandar Shah Mohd Zawawi
E-ISSN: 2224-3402
401
Volume 20, 2023
Fig. 3: The number of vaccination recipients
Fig. 4: The average of daily new cases based on
category
4.2 Data Modeling
The attributes of daily cases, immunization, and the
group of children, adolescents, and adults were
selected and converted into a format compatible with
Weka. This study's objective is to determine which
trends could have an impact on the number of deaths
in Malaysia, so deaths new serves as y and other
trends serve as x in a series of experiments based on
Linear Regression (LR).
y Ax B
. (1)
To measure the relationship between the
independent variable (
x
) and dependent variable (
y
),
constant and can be estimated by fitting the
experimental data on the variables and through the
method of least squares.
Next, the relationship of the input and output
variables was also tested using LR, with configuration
in Figure 5, RF in Figure 6, and SVR, as in Figure 7.
The training and testing process was done using the
hold-out method with a ratio of 60:40 and 70:30, and
also the cross-validation (CV) method, with partitions
of 10 and 20-fold.
Fig. 5: Configuration for MLR model
Fig. 6: Configuration for RF model
0
100000
200000
300000
400000
500000
600000
700000
1/25/2020
2/25/2020
3/25/2020
4/25/2020
5/25/2020
6/25/2020
7/25/2020
8/25/2020
9/25/2020
10/25/2020
11/25/2020
12/25/2020
1/25/2021
2/25/2021
3/25/2021
4/25/2021
5/25/2021
6/25/2021
7/25/2021
8/25/2021
9/25/2021
10/25/2021
11/25/2021
12/25/2021
1/25/2022
2/25/2022
3/25/2022
4/25/2022
5/25/2022
No. of daily vaccination
daily_vax
5.185
685 332
3.605
0cases_new cases_child cases_adolescent cases_adult
count
Average number of daily new cases based on
category
Average
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2023.20.42
Sofianita Mutalib, Siti Nurjeha Mohd Pungut,
Aida Wati Zainan Abidin, Shamimi A Halim,
Iskandar Shah Mohd Zawawi
E-ISSN: 2224-3402
402
Volume 20, 2023
Fig. 7: Configuration for SVR model
4.3 Model Evaluation
During the evaluation phase, the performance of the
algorithms was compared based on how each
algorithm satisfied the established criteria. In
addition, the researcher must ensure that the
procedures are followed precisely to construct the
most accurate model. This study concentrates on
RMSE and Karl Person’s correlation coefficient to
evaluate the performance of each regression model.
The RMSE is the square root of the mean squared
error (MSE). For a sample of
n
observations
( , 1,2,..., )
i
y y i n
and
n
corresponding model
predictions
ˆ
y
, the RMSE is given as follows, [30]:
2
1
1ˆ
RMSE ,

n
ii
i
yy
n
(2)
Correlation is a technique that measures the
nature, degree, and extent of association existing
between two continuous variables. Karl Pearson’s
correlation coefficient is a measure of the degree of
relationship between two variables
x
and
y
, which is
expressed as follows, [31]:
1 1 1
22
22
1 1 1 1

n n n
i i i i
i i i
n n n n
i i i i
i i i i
n x y x y
r
n x x n y y
, (3)
where is the number of observations. The values are
between -1 and 1; a positive value is a positive
relationship, while a negative value indicates a
negative association of variables. A value is an
indicator of a negligible relationship between
variables. The correlation values are ‘moderate’ and
‘strong’, with values of 0.5 0.7, and 0.7,
respectively. The correlation value (
0.5r
) implies a
weak correlation.
5 Results and Findings
There are five experiments reported for the prediction
model of new deaths, based on input attributes, as
shown in Table 2 using regression methods. The
presented graphs are based on 10-fold cross-
validation, 20-fold cross-validation, and a split
percentage of 60:40 and 70:30 with different
algorithms. The results of each model are then
compared to get the smallest RSME value among
them.
Table 2. The five experiments of regression models.
Experiment
Input variable, x
Outcome
(target), y
1
x = daily new cases
y = number of
new deaths
2
x1 = number of
immunizations
x2 = daily new cases
y = number of
new deaths
3
x1 = number of
immunizations to children
x2 = daily new cases
among children
y = number of
new deaths
4
x1 = number of
immunization receivers
x2 = new patient among
adolescent
y = number of
new deaths
5
x1 = number of
immunization receivers
x2 = new patient among
adult
y = number of
new deaths
5.1 Experiment 1: To Predict a New Death
number
In experiment 1, our study applied SLR, RF, and SVR
modeling techniques to model the COVID-19 trends,
and the independent variable selected for x, which is
cases_new to predict the deaths_new that holds the
value of y in LR. Figure 8 shows RF has the lowest
RMSE of the three models, with a split percentage of
54.9854 at 10-fold cross-validation and 54.8599 at
20-fold cross-validation.
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2023.20.42
Sofianita Mutalib, Siti Nurjeha Mohd Pungut,
Aida Wati Zainan Abidin, Shamimi A Halim,
Iskandar Shah Mohd Zawawi
E-ISSN: 2224-3402
403
Volume 20, 2023
Fig. 8: The comparison of RSME for the developed
models in Experiment 1.
5.2 Experiment 2: To Predict the New Death
Number
In this experiment, another variable, which is the
number of immunizations, was combined with the
daily number of new cases applied in the model to
forecast the occurrence of new deaths. With 20-fold
cross-validation, RF produces the lowest RMSE for
the second experiment with 27.1664, as shown in
Figure 9.
Fig. 9: The comparison of RMSE for the developed
models in Experiment 2.
5.3 Experiment 3: To Examine the Influence
of Immunization and the New Cases
among Children on the New Deaths
Experiment 3 examined the number of immunizations
given to children and the number of new COVID-19
cases among children in terms of influencing death.
From Figure 10, RF has the lowest RSME with
27.4661 and 25.7022 at 10-fold and 20-fold cross-
validation.
Fig. 10: The comparison of RMSE for the developed
models in Experiment 3.
5.4 Experiment 4: To Examine the Influence
of Immunization and the New Cases
among Adolescents to New Deaths
The fourth experiment concentrated on the adolescent
new COVID-19 patients and the quantity of
immunizations they had received to predict the
deaths. Based on Figure 11, RF was the most
performed regression model in this experiment, based
on the RSME scoring at 32.7991 and 32.1197 at a 10-
fold cross-validation and 20-fold cross-validation.
Fig. 11: The comparison of RMSE for the developed
models in experiment 4.
5.5 Experiment 5: To Examine the Influence
of Immunization and the New Cases
among Adults in the New Deaths
In this experiment, the number of immunizations and
the new cases among adults are the input. Based on
Figure 12, RF showed the lowest score of RSME with
22.8123 and 22.7611 at cross-validation 10-fold and
20-fold.
58,4176
58,3323
70,6572
74,5921
54,9854
54,8599
57,1381
59,0701
58,9269
58,7001
70,1668
74,4338
0
10
20
30
40
50
60
70
80
K - F O L D = 1 0 K - F O L D = 2 0 S P LI T 6 0 : 4 0 S P L I T 7 0 : 3 0
EXPERIMENT 1 - RMS E
SLR RF SVR
46,0016
45,7999
57,8904
63,4693
27,1744
27,1664
34,1673
35,8605
47,5762
47,5674
62,0599
67,8216
0
10
20
30
40
50
60
70
80
K-F OLD = 10 K- FOL D = 2 0 S PLI T 6 0 :
40 SPL IT 7 0 :
30
EXPERIMENT 2 - RMS E
SLR RF SVR
40,5135
40,2828
50,6569
54,3088
27,4661
25,7022
35,4806
39,198
42,1664
41,9965
55,0881
58,7294
0
10
20
30
40
50
60
70
K - F OL D = 10 K - F OL D = 2 0 S P LI T 6 0 : 4 0 S P L I T 7 0 : 3 0
EXPERIMENT 3 - RMSE
SLR RF SVR
50,1875
50,0277
60,2056
32,7991
32,1197
44,3423
44,8022
50,7147
50,5027
60,9837
64,24
0
10
20
30
40
50
60
70
K-F OLD = 10 K- FOL D = 20 S PLI T 6 0 :
40 SPL IT 7 0 :
30
EXPERIMENT 4 - RMSE
SLR RF SVR
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2023.20.42
Sofianita Mutalib, Siti Nurjeha Mohd Pungut,
Aida Wati Zainan Abidin, Shamimi A Halim,
Iskandar Shah Mohd Zawawi
E-ISSN: 2224-3402
404
Volume 20, 2023
Fig. 12: The comparison of RMSE for the developed
models in Experiment 5.
In comparing the RF results among the five
experiments, the lowest RSME is gained when
predicting death based on immunization and new
cases among adults. Adolescent and child new cases
influence a little bit lower if we compare the RSME
score. Meanwhile, Figure 13 shows the average
correlation coefficient among the regression models
in five experiments for four setups of training and
testing. In Figure 9, SLR/MLR and SVR have a lower
median line than RF. The distribution of the
correlation coefficient for all experiments is shifted
upwards relative to SLR/MLR and SVR. The upper
quartile for SLR/MLR and SVR overlaps with the
lower quartile for RF. The correlation coefficient
variability of each algorithm is almost similar to each
other, with the median being negatively skewed due
to being closer to the upper value. It is meant that the
majority of the correlation values are ‘moderate or
‘strong’, concerning the values of 0.5 - 0.7 or more
than 0.7, respectively. Compared to the studies in,
[17], [18], and, [19], they used supervised methods
based on classification to the level of risk or either
positive or negative class labels, so their results are
produced in an accuracy measure.
The main limitation of this analysis was that it
takes the daily data without considering the exact
location within Malaysia. Some areas are contributing
to a higher number of cases and a faster rate of
spread. Despite all the limitations, the biggest
strength of this study was several experiments done,
mainly in the general type of age, among adults,
children, and adolescents. Despite that, the training
and testing methods were implemented in several
settings of cross-validation and split methods to
confirm the performance of regression models.
Fig. 13: The box plot of the average correlation
coefficient for the developed models.
6 Conclusion
The COVID-19 attributes that are affecting Malaysian
death cases have been explored. The goal of our study
is to perform experiments in developing a prediction
model that will be more significant in the future.
Next, the best prediction techniques for modeling
COVID-19 trends are identified. After doing several
experiments with each modeling technique, the RF
regression model achieved the lowest RMSE in all
experiments, which is the lowest RMSE of 22.8123 at
experiment 5. The performance of RF is also
maintained in predicting the occurrence of fatalities.
There are opportunities to enhance research in the
creation of regression models in the future. Studies
can be conducted to identify potential future trends
that may impact pandemics or other pandemic
fatalities. A new experiment incorporating the
variable of vaccination type can also be conducted
during the current pandemic. The dataset will soon
have over 1,000 data points per attribute, which may
increase the experiment's accuracy. At the current
time, a modeling experiment was done with a dataset
that has not exceeded 1000 data points, so the study's
complexity has not been fully realized. Moreover, the
developed model in this study can be expanded to
build an application with the following advantages:
for individuals, to identify the possibility of
COVID-19 based on symptoms appearing.
for organizations, predicting the possibility of
the risk becoming higher or lower based on
the importance of locations.
for the government, to monitor the risk of
spread.
38,749
38,5013
48,8
53,7363
22,8123
22,7611
34,3059
36,1779
40,8272
40,6387
54,4353
58,7298
0
10
20
30
40
50
60
70
K-F OLD = 10 K- FOL D = 2 0 S PLI T 6 0 :
40 SPL IT 7 0 :
30
EXPERIMEN T 5 - RMSE
SLR RF SVR
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2023.20.42
Sofianita Mutalib, Siti Nurjeha Mohd Pungut,
Aida Wati Zainan Abidin, Shamimi A Halim,
Iskandar Shah Mohd Zawawi
E-ISSN: 2224-3402
405
Volume 20, 2023
Research incorporating further information or
symptoms from hospital records, virus-acquired
individuals, COVID-19 survivors, and patients
undergoing examination, or treatment may also be
taken into consideration. To give more details about
the required actions and potential therapies to take
into consideration, a model that can determine the
likely severity of COVID-19 can also be constructed.
Acknowledgments:
The authors would like to express their gratitude to
the Research Management Center, Universiti
Teknologi MARA, Malaysia for the research fund
600-RMC/GPM ST 5/3 (021/2021) and College of
Computing, Informatics and Mathematics, Universiti
Teknologi MARA, Shah Alam, Selangor, Malaysia
for the research support. The authors also thank
Muhammad Danish Hazim Bin Rahmad and
Muhammad Umarul Aiman Bin Mohamad Zulhilmi
for their assistance.
References:
[1] A. U. M. Shah, S. N. A. Safri, R. Thevadas, N.
K. Noordin, A. A. Rahman, Z. Sekawi, A.
Ideris, and M. T. H. Sultan, "COVID-19
outbreak in Malaysia: Actions taken by the
Malaysian government," Int. J. Infect. Dis., vol.
97, pp. 108-116, 2020.
https://doi.org/10.1016/j.ijid.2020.05.093
[2] A. Elengoe, "COVID-19 Outbreak in
Malaysia," Osong Public Health Res. Perspect.,
vol. 11, no. 3, pp. 93-100, 2020.
https://doi.org/10.24171/j.phrp.2020.11.3.08
[3] S. Moore, E. M. Hill, L. Dyson, M. J. Tildesley,
and M. J. Keeling, "Modelling optimal
vaccination strategy for SARS-CoV-2 in the
UK," PLoS Comput. Biol., vol. 17, no. 5, pp.
e1008849, 2021.
https://doi.org/10.1371/journal.pcbi.1008849
[4] S. Singh, B. M. Sundram, K. Rajendran, K. B.
Law, T. Aris, H. Ibrahim, S. C. Dass, and B. S.
Gill, "Forecasting daily confirmed COVID-19
cases in Malaysia using ARIMA models," The
Journal of Infection in Developing Countries,
vol. 14, no. 09, pp. 971976, 2020.
https://doi.org/10.3855/jidc.13116
[5] S. Singh, B. M. Sundram, K. Rajendran, K. B.
Law, T. Aris, H. Ibrahim, S. C. Dass, and B. S.
Gill, "Forecasting daily confirmed COVID-19
cases in Malaysia using ARIMA models," The
Journal of Infection in Developing Countries,
vol. 14, no. 09, pp. 971976, 2020.
https://doi.org/10.3855/jidc.13116
[6] M. Ekum and A. Ogunsanya, "Application of
hierarchical polynomial regression models to
predict transmission of COVID-19 at global
level," Int J Clin Biostat Biom, vol. 6, no. 1, pp.
27, 2020.
[7] M. N. Alenezi, F. S. Al-Anzi, and H.
Alabdulrazzaq, "Building a sensible SIR
estimation model for COVID-19 outspread in
Kuwait," Alexandria Engineering Journal, vol.
60, no. 3, pp. 31613175, 2021.
https://doi.org/10.1016/j.aej.2021.01.025
[8] F. Syed and S. Sibgatullah, "Estimation of the
Final Size of the COVID-19 Epidemic in
Pakistan,"
https://doi.org/10.1101/2020.04.01.20050369
[9] F. Nyabadza, F. Chirove, W. Chukwu, and M.
V. Visaya, "Modelling the potential impact of
social distancing on the COVID-19 epidemic in
South Africa,"
https://doi.org/10.1101/2020.04.21.20074492
[10] H. B. Taboe, K. V. Salako, J. M. Tison, C. N.
Ngonghala, and R. G. Kakaï, "Predicting
COVID-19 spread in the face of control
measures in West Africa," Mathematical
Biosciences, vol. 328, p. 108431, 2020.
https://doi.org/10.1016/j.mbs.2020.108431
[11] C. Wang, L. Liu, X. Hao, H. Guo, Q. Wang, J.
Huang, N. He, H. Yu, X. Lin, A. Pan, S. Wei,
and T. Wu, "Evolving Epidemiology and
Impact of Non-pharmaceutical Interventions on
the Outbreak of Coronavirus Disease 2019 in
Wuhan, China,"
https://doi.org/10.1101/2020.03.03.20030593
[12] H. E. Randolph and L. B. Barreiro, "Herd
Immunity: Understanding COVID-19,"
Immunity, vol. 52, no. 5, pp. 737-741, 2020.
https://doi.org/10.1016/j.immuni.2020.04.012
[13] K. B. Law, M. P. K, H. Mohd Ibrahim, and N.
H. Abdullah, "Modelling infectious diseases
with herd immunity in a randomly mixed
population," Sci. Rep., vol. 11, no. 1, pp.
20574, 2021. https://doi.org/10.1038/s41598-
021-00013-2
[14] K. M. A. Kabir, K. Kuga, and J. Tanimoto,
"Analysis of SIR epidemic model with
information spreading of awareness," Chaos,
Solitons & Fractals, vol. 119, pp. 118-125,
2019.
https://doi.org/10.1016/j.chaos.2018.12.017
[15] P. A. R. Azmi, A. W. Z. Abidin, S. Mutalib, I.
S. M. Zawawi and S. A. Halim, "Sentiment
Analysis on MySejahtera Application during
COVID-19 Pandemic," 2022 3rd International
Conference on Artificial Intelligence and Data
Sciences (AiDAS), IPOH, Malaysia, 2022, pp.
215-220,
DOI: 10.1109/AiDAS56890.2022.9918748.
[16] N. Khalid, S. Abdul-Rahman, W. Wibowo, N.
S. Abdullah, and S. Mutalib, Leveraging
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2023.20.42
Sofianita Mutalib, Siti Nurjeha Mohd Pungut,
Aida Wati Zainan Abidin, Shamimi A Halim,
Iskandar Shah Mohd Zawawi
E-ISSN: 2224-3402
406
Volume 20, 2023
social media data using latent dirichlet
allocation and naïve bayes for mental health
sentiment analytics on Covid-19
pandemic,” International Journal of Advances
in Intelligent Informatics, 9(3), 457-471, 2023,
https://doi.org/10.26555/ijain.v9i3.1367
[17] A Aljumah, “Assessment of Machine Learning
Techniques in IoT-Based Architecture for the
Monitoring and Prediction of COVID-19,”
Electronics. 2021; 10(15):1834.
https://doi.org/10.3390/electronics10151834
[18] L. J. Muhammad, M. M. Islam, S. S. Usman,
and S. I. Ayon, "Predictive Data Mining
Models for Novel Coronavirus (COVID-19)
Infected Patients' Recovery," SN Comput. Sci.,
vol. 1, no. 4, pp. 206, 2020.
https://doi.org/10.1007/s42979-020-00216-w
[19] L. J. Muhammad, E. A. Algehyne, S. S. Usman,
A. Ahmad, C. Chakraborty and I. A.
Mohammed, Supervised Machine Learning
Models for Prediction of COVID-19 Infection
using Epidemiology Dataset,” SN Comput Sci,
2(1), 11, 2021, https://doi.org/10.1007/s42979-
020-00394-7
[20] V. Bhadana, A. S. Jalal and P. Pathak, A
Comparative Study of Machine Learning
Models for COVID-19 prediction in India,”
2020 IEEE 4th Conference on Information
Communication Technology (CICT), 17,
2020.
[21] C. N. Villavicencio, J. J. E. Macrohon, X. A.
Inbaraj, J-H Jeng and J-G Hsieh, Covid-19
Prediction Applying Supervised Machine
Learning Algorithms with Comparative
Analysis Using WEKA, Algorithms 2021, 14,
201. https://doi.org/10.3390/a14070201
[22] V. Z. Antonopoulos, D. M. Papamichail, V. G.
Aschonitis, and A. V. Antonopoulos, "Solar
radiation estimation methods using ANN and
empirical models," Comput. Electron. Agric.,
vol. 160, pp. 160-167, 2019.
https://doi.org/10.1016/j.compag.2019.03.022
[23] A. Khazaee Poul, M. Shourian, and H.
Ebrahimi, "A Comparative Study of MLR,
KNN, ANN and ANFIS Models with Wavelet
Transform in Monthly Stream Flow
Prediction," Water Resour. Manag., vol. 33, no.
8, pp. 2907-2923, 2019.
https://doi.org/10.1007/s11269-019-02273-0
[24] Y. Chen, W. Zheng, W. Li and Y. Huang,
“Large group activity security risk assessment
and risk early warning based on random forest
algorithm.” Pattern Recognition Letters, 144,
pp1-5, 2021,
https://doi.org/10.1016/j.patrec.2021.01.008
[25] Y.-C. Chen, P.-E. Lu, C.-S. Chang, and T.-H.
Liu, "A Time-Dependent SIR Model for
COVID-19 With Undetectable Infected
Persons," IEEE Trans. Netw. Sci. Eng., vol. 7,
no. 4, pp. 3279-3294, 2020.
https://doi.org/10.1109/TNSE.2020.3024723
[26] J. Dou, A. P. Yunus, D. Tien Bui, A. Merghadi,
M. Sahana, Z. Zhu, C. W. Chen, K. Khosravi,
Y. Yang, and B. T. Pham, "Assessment of
advanced random forest and decision tree
algorithms for modeling rainfall-induced
landslide susceptibility in the Izu-Oshima
Volcanic Island, Japan," Sci. Total Environ.,
vol. 662, pp. 332-346, 2019.
https://doi.org/10.1016/j.scitotenv.2019.01.221
[27] P. Aghelpour, B. Mohammadi, and S. M.
Biazar, "Long-term monthly average
temperature forecasting in some climate types
of Iran, using the models SARIMA, SVR, and
SVR-FA," Theoretical and Applied
Climatology, vol. 138, no. 3-4, pp. 1471-1480,
2019. https://doi.org/10.1007/s00704-019-
02905-w
[28] Q. Quan, Z. Hao, H. Xifeng, and L. Jingchun,
"Research on water temperature prediction
based on improved support vector regression,"
Neural Comput. Appl., vol. 34, no. 11, pp.
8501-8510, 2020.
https://doi.org/10.1007/s00521-020-04836-4
[29] Ministry of Health Malaysia, "Official data -
COVID-19,” 2022. [Online],
https://github.com/MoH-Malaysia/covid19-
public (Accessed Date: October 31, 2023)
[30] T. O. Hodson, T. O., Root-mean-square error
(RMSE) or mean absolute error (MAE): when
to use them or not. Geoscientific Model
Development,” 15(14), 54815487, 2022.
https://doi.org/10.5194/gmd-15-5481-2022
[31] A. A. Suleiman, U. A. Abdullahi, A. Suleiman,
S. A. Suleiman, and H. U. Abubakar,
Correlation and Regression Model for
Physicochemical Quality of Groundwater in the
Jaen District of Kano State, Nigeria,” Journal of
Statistical Modeling and Analytics, Vol. 4,
Issue 1, 2022.
https://doi.org/10.22452/josma.vol4no1.2
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2023.20.42
Sofianita Mutalib, Siti Nurjeha Mohd Pungut,
Aida Wati Zainan Abidin, Shamimi A Halim,
Iskandar Shah Mohd Zawawi
E-ISSN: 2224-3402
407
Volume 20, 2023
Contribution of Individual Authors to the
Creation of a Scientific Article (Ghostwriting
Policy)
The authors equally contributed in the present
research, at all stages from the formulation of the
problem to the final findings and solution.
Sources of Funding for Research Presented in a
Scientific Article or Scientific Article Itself
The authors would like to express their gratitude to
the Research Management Center, Universiti
Teknologi MARA, Malaysia for the research fund
600-RMC/GPM ST 5/3 (021/2021)
Conflict of Interest
The authors have no conflicts of interest to declare.
Creative Commons Attribution License 4.0
(Attribution 4.0 International, CC BY 4.0)
This article is published under the terms of the
Creative Commons Attribution License 4.0
https://creativecommons.org/licenses/by/4.0/deed.en_
US
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2023.20.42
Sofianita Mutalib, Siti Nurjeha Mohd Pungut,
Aida Wati Zainan Abidin, Shamimi A Halim,
Iskandar Shah Mohd Zawawi
E-ISSN: 2224-3402
408
Volume 20, 2023