
between the 72 conditional features in the original
dataset, and there are some missing data also which
is common in real-world data. Therefore, it was very
difficult to create an accurate classification model
for ozone days until now.
In this paper to solve the high dimensionality
and multicollinearity of the dataset, PCA is applied
first. As a result, the 72 attributes are reduced to 20
attributes and generated a slightly better machine
learning model of random forests. So, we needed
more sufficient data, especially for the minor class.
For this purpose oversampling method of SMOTE
which is one of the representative oversampling
methods was applied at very high rates repeatedly to
find enough size of samples for better classification
models.
Since training a machine learning model by
supplying more high-quality training instances
increases the probability of obtaining a machine
learning model with higher accuracy, it was also
checked whether a better machine learning model of
random forests can be obtained after applying
oversampling at the same very high rate for each
class and generating much more synthetic data than
the original data and using it for training the random
forests. However, because such synthetic data by
oversampling is different from the original data, the
synthetic data is compared with the original data to
see if it is statistically reliable using boxplot and t-
test. As shown in Table 32 and Table 33, the results
of the experiment showed that when the
oversampling rate was relatively high with the
suggested method as we can see in the experiment,
it could be possible to generate synthetic data with
statistical characteristics similar to the original data
on the condition that statistical tests are backed, and
by using it to train the random forests, it could be
possible to generate random forests with higher
accuracy than using the original data alone, from 94%
to 100%. Note that the random forests generated
from the original data alone have no capability of
classifying ozone days as we see in Table 2.
Conventionally until now, oversampled data has
been used to train machine learning models
neglecting statistical analysis, so, we may wonder
how much it resembles the original data so that the
synthetic data may be used to improve machine
learning models. Applying a similar approach to this
study to wine data in the UCI machine learning
repository also showed that this approach was very
useful in generating a very accurate knowledge
model of random forests, [20]. Therefore, this paper
is significant in the sense that it shows how we can
apply a statistical methodology to test how reliable
the oversampled data can be and it also shows this
method can be effective in dealing with class
imbalance problems. Note that the ozone day data
were collected every day for 7 years, so future
research will be building a knowledge model that
can predict whether or not ozone day is in a few
days based on the time-series data.
Acknowledgment:
This work was supported by Dongseo University,
“Dongseo Frontier Project” Research Fund of 2024.
References:
[1] K. Zhang, W. Fan, X. Yuan, I. Davidson,
Forecasting Skewed Biased Stochastic
Ozone Days: Analyses and Solutions,
Knowledge and Information Systems, Vol.
14, 2008, pp. 299-326.
https://doi.org/10.1007/s10115-007-0095-1.
[2] W. Jia, M. Sun, J. Lian, Feature
dimensionality reduction: a review, Complex
Intelligent Systems, Vol. 8, 2022, pp. 2663-
2693. https://doi.org/10.1007/s40747-021-
00637-x.
[3] S. Saha, S. Bhattacharya, A Survey:
Principle Component Analysis(PCA),
International Journal of Advanced Research
in Science and Engineering, Vol. 6, Issue 6,
2017, pp. 312-320.
[4] Y. Wei, Y. Tang, P.D. McNicholas, Flexible
High-Dimensional Unsupervised Learning
with Missing Data, IEEE Transactions on
Pattern Analysis and Machine Intelligence,
Vol. 42, No. 3, 2020, pp. 610-621.
[5] A. Sarkar, S.S. Ray, A. Prasad, C. Pradhan,
A Novel Detection Approach of Ground
Level Ozone using Machine Learning
Classifiers, 2021 Fifth International
Conference on I-SMAC (IoT in Social,
Mobile, Analytics and Cloud), Palladam,
India, 11-13 November 2021, DOI:
10.1109/I-SMAC52330.2021.9640852.
[6] V. Laveglia, E. Trentin, Downward-Growing
Neural Networks, Entropy, Vol. 25, No. 5,
733, 2023.
https://doi.org/10.3390/e25050733.
[7] J. Shlens, A Tutorial on Principle
Component Analysis, arXiv:1404.1100,
https://doi.org/10.48550/arXiv.1404.1100.
(Accessed Date: June 24, 2024).
[8] H. Almuallim, S. Kaneda, Y. Akiba,
Development and Applications of Decision
Trees, Expert Systems, edited by C.T.
WSEAS TRANSACTIONS on ENVIRONMENT and DEVELOPMENT
DOI: 10.37394/232015.2024.20.81