WSEAS Transactions on Environment and Development
Print ISSN: 1790-5079, E-ISSN: 2224-3496
Volume 20, 2024
Ozone Day Classification using Random Forests with Oversampling and Statistical Tests
Author:
Abstract: Accurate warning of ozone concentration levels in the air is very important for public health. However, the characteristics of the public data related to ozone level detection in the UCI machine learning repository make it difficult to build a warning system based on machine learning techniques. The data consists of 72 relatively large numerical attributes and are measured and collected for 7 years with some blank data, and the distribution of ozone days and normal days is very unbalanced, making it difficult to create an accurate classification model. In this paper to solve the high dimensional attribute problem PCA is applied first, resulting in the 72 attributes being reduced to 20 attributes, and generating slightly better random forests, but the classification for ozone days is still poor due to insufficient data. To solve the insufficient data problem for the minor class which is 6.3% of the total, SMOTE which is one of the representative oversampling methods is applied to a minor class at very high rates repeatedly. It was also checked whether a better machine learning model of random forests can be obtained after applying oversampling at the same very high rate for each class, generating much more synthetic data than the original data and using it to train the random forests. In addition, to ensure the reliability of the synthetic data generated by SMOTE statistical test has been done for each attribute to see if it is statistically reliable. The results of the experiment showed that when the oversampling rate was relatively high with the suggested oversampling and statistical tests, it could be possible to generate synthetic data with statistical characteristics similar to the original data, and by using it to train the random forests, it could be possible to generate random forests with higher and more balanced classification accuracy than using the original data alone, from 94% to 100%. In this sense, this paper has contributed that it provides a methodology to increase the reliability of the machine learning model of random forests for very skewed and high dimensional data like the ozone day classification dataset.
Search Articles
Keywords: Ozone level detection, numerical attributes, high dimensionality of data, PCA, data skewness, oversampling, box plot, t-test, random forests
Pages: 863-882
DOI: 10.37394/232015.2024.20.81