Comparative Analysis of Machine Learning for Predicting Air
Quality in Smart Cities
KAMEL MAALOUL, LEJDEL BRAHIM
Informatique
Department
El-Oued University
El Chott , El-Oued
ALGERIA
Abstract: - Ambient air pollution is the most harmful environmental risk to health. As urban air quality
improves, health costs from air pollution-related diseases diminish. This is why air pollution is a major
challenge for the public and government around the world. Deployment of the Internet of Things-based
sensors has considerably changed the dynamics of predicting air quality. Air pollution can be predicted
using machine learning algorithms Data-based sensors in the context of smart cities. In this paper, we
performed pollution forecasting using machine learning techniques while presenting a comparative
study to determine the best model to accurately predict air quality. Random Forest is an efficient
algorithm capable of detecting air quality.
Key-Words: - Air Quality; Machine Learning; Random Forest; Smart Cities; Linear Regression; IOT.
Received: September 12, 2021. Revised: May 7, 2022. Accepted: June 3, 2022. Published: July 2, 2022.
1 Introduction
The smart city exploits information and
communication technology (ICT) to optimize
populations’ lives in areas including such utilities,
transportation, energy, and, most importantly,
health. It also allows the government to make better
use of the resources it has [1].The most pressing
issues at the moment Smart cities have high levels of
air pollution. The potential to affect the features of
the atmosphere and increase health hazards to
organisms, as well as cause their destruction, is
characterized as air pollution [2]. Air pollution
impacts people’s health by exposing them to
pollutants in the environment. Air pollution must be
swiftly regulated in order to avoid a calamity.
Environmental quality forecasting must be done
quickly in order to take preventative measures in
smart cities [3]. The capacity to retrieve data from
sensors for smart cities is a significant benefit in
addressing the issues and roadblocks that impede
their development [4].
In the development of smart city concepts, air
quality is a critical component. Air quality, on
the other hand, is a big issue in many locations. The
high population growth led to an increase in the use
of transportation, energy, industry and vehicles.
According to many studies, air quality in smart cities
represents a major challenge for smart cities and that
machine learning is an effective and better solution
to predict air quality, although it is not used in many
cities of the world. Different algorithms and
techniques machine learning have been utilized in
the past for various use cases, including Neural
Networks, fuzzy a system, Support Vector Machine,
Support Vector Machine for regression, fuzzy logic,
Decision Trees, and K-Nearest Neighbor.
Sophisticated and advanced machine learning
techniques are increasingly being used, with 2.5
micrometers particulate matter (PM2.5) being the
main target element; air quality data is best
combined with other types of data. It is critical to
guarantee that analysis of data is accurate and
efficient. Data misinterpretation can lead to
erroneous conclusions, which can be extremely
dangerous. To guarantee the velocity and analysis of
communication sources, artificial intelligence
predictions and the usage of reliable data must be
carefully considered [1].
We used multiple machine learning approaches
to predict pollution in this work. In order to
determine the best and most accurate predictive
model for pollution estimate, we give a
WSEAS TRANSACTIONS on COMPUTERS
DOI: 10.37394/23205.2022.21.30
Kamel Maaloul, Lejdel Brahim
E-ISSN: 2224-2872
248
Volume 21, 2022
comparative analysis of these methodologies based
on acknowledged evaluation criteria. We
investigated the processing time of various
technologies using independent learning and
Apache Spark, given the criteria for effective real-
time analysis of smart city data. We proposed the
best model in terms of both processing time and
error rate in this research. The purpose of this
research is to use a variety of machine learning
models to provide an accurate and comparative
examination of air quality. The results of data
collection utilizing sensors at a certain site are
reviewed and compared to identify the highest
performing algorithm and accuracy.
The remainder of the paper is structured as
follows: The relevant work is briefly discussed in
Section 2 and how it differs from our submitted
study. The third section discusses machine learning
algorithms and how to use the model. The
proposed model, empirical evaluation method, and
data gathering process are all explained in Section
4. Section 5 contains the findings and their
commentary. Finally, Section 6 summarizes the
paper’s findings as well as future study directions.
2 Related Works
In this section, we mention some previous studies
related to our topic.
Martinez-Espana et al. [5] use of Machine
Learning algorithms to forecast particulate matter
concentration in the atmospheric air has been
discussed. Machine learning technologies such as
linear regression, ML Pregressor neighbors
regressor, Decision Tree regressor, and gradient
boosting regressor were used in the experiments.
The authors convincingly demonstrated that
Gradient Boosting Regression outperformed all
other techniques.
Hamami et al. [6] proposes air quality category
using class algorithms together with Logistic
Regression, KKN, decision Tree, and Random forest
set of rules. primarily based on experiment, decision
tree model has the first-rate accuracy to categories
air excellent degree as much as 100% with tuning
numerous hyper parameters.
Fernando et al. [7] have a look at has been to
find the maximum appropriate system studying
method for predicting accurate air quality index in
Colombo based upon PM2.five unique
awareness.PM2.five concentration in Colombo
were anticipated the usage of four correlated air
pollutant concentrations including SO2, NO2,
PM2.5, PM10. machine mastering techniques
consisting of k-Nearest Neighboring, more than
one Linear-Regression, Random forest, and support
Vector Machines have been used to educate and
examine the prediction models. Random forest
changed into identified because the excellent
appropriate prediction model after evaluating the
models, with over 85% extra accuracy.
Abirami et al. [8] predict AQI accurately with
utilizing the information set on one of a kind ML
version with proper pre-processing method for
locating nature regarding the air rests for the
maximum component signified with the aid of its
AQI (air high-quality index) value. It is tried to
gauge the air situation inside the bounds of a
definite region by using utilizing device learning
strategies like aid vector regression (SVR),
selection tree regression (DTR), a couple of linear
regression (MLR) and random forest regression
(RFR). RFR performed out the finest among all
regression examples.
Murugan et al. [9] implement machine learning
algorithms to discover the accuracy of the
prediction of particulate be counted, PM2.five
in air pollutants in smart towns . to test the
implementation of device studying on this
prediction, Multi-Layer Perceptron (MLP), and
Random forest are selected and in comparison
among those algorithms the usage of the Air
pollution dataset. The outcome of this studies is
that Random forest area gave the first-rate accuracy
in prediction of Particulate count number.
On this paper, Sinnott et al. [10] explored using
machine learning algorithms for prediction of
pollution and specifically PM2.five. Prediction of
pollutants activities is more and more essential in
primary cities due to the increased urbanization of
populations and associated effect on site visitors
volumes. statistics from a selection of
heterogeneous assets became used and concerned
collection and cleaning for use in machine learning
algorithms. Linear regression models, ANN models
and LSTMs had been all explored. It turned into
found that LSTM completed best and became able
to are expecting high PM2.5 values with affordable
accuracy. ANN and linear models have drawbacks
in prediction of high PM2.five values but they
provided reasonable standard performance.
WSEAS TRANSACTIONS on COMPUTERS
DOI: 10.37394/23205.2022.21.30
Kamel Maaloul, Lejdel Brahim
E-ISSN: 2224-2872
249
Volume 21, 2022
In this paper, Pasupuleti et al. [11] are
introducing a device that could continue that could
take gift pollutants and with the help of beyond
pollutants, we are walking an algorithm based
totally at the machine learning to are expecting the
future data of pollutants. The sensed information is
stored within the Excel sheet for in addition
evaluation. those sensors are used on the Arduino
platform to accumulate the pollutant data. We are
expecting the air high-quality index by means of
the use of specific machine learning algorithms like
linear regression, choice Tree and Random forest.
From the consequences, we concluded that the
Random forest set of rules gives better prediction
of air satisfactory index.
Vu et al. [12] applied a machine learning-based
random forest technique to determine the
effectiveness of the Beijing Action Plan by
separating the impact of meteorology on
ambient air quality. The effects display that
meteorological conditions have an important have
an effect on at the annual adjustments in ambient
air satisfactory. The movement plan has
additionally been very effective in lowering
primary pollutants emissions and enhancing
Beijing’s air pleasant. This work is a a hit instance
of air excellent policy development in other areas
of China and other developing international
locations.
3 Proposed Architecture
Figure 1 describes the architecture of the proposed
model. Data on air pollution is collected via
sensors and saved as a dataset. This data set has
already been preprocessed. The data set is then
separated into two parts: a training data set and a
test data set. The training dataset is subjected to
more supervised machine learning methods. The
collected results are then compared to the test data
set and assessed.
Figure 1: The architecture of the
implemented System
4 Background Machine Learning
Algorithms
The Supervised Machine Learning approach to air
pollution prediction takes into account machine
learning algorithms such as:
4.1 Linear Regression
When continuous variables are utilized to predict
actual values, linear regression is used. Linear
Regression is a machine learning algorithm that
performs a regression task and is based on
supervised learning. Linear regression delivers a
target prediction value based on independent
variables, which is most frequently employed for
discovering the connection between variables and
forecasting [13]. In the Figure 2 above, on X-axis
is the unbiased variable and on Y-axis is the output.
WSEAS TRANSACTIONS on COMPUTERS
DOI: 10.37394/23205.2022.21.30
Kamel Maaloul, Lejdel Brahim
E-ISSN: 2224-2872
250
Volume 21, 2022
The regression line is the exceptional in shape line
for a model. And our major objective on this
algorithm is to locate this best in shape line.
Figure 2: Graph of Linear regression
Linear Regression may also lead to
over-becoming but it may be averted the
use of some dimensionality discount
strategies, regularization strategies, and
go-validation. It over-simplifies real-
international issues via assuming a linear
courting a few of the variables, therefore
now not encouraged for realistic use-
cases(Figure 2).
4.2 Decision Tree
Decision tree is a supervised learning algorithm that
is used to describe a choice made in response to a
circumstance. It can be used for classification as
well as regression. The decision tree is always built
from the top down. The Decision Tree Regression is
a non linear and non-continuous entity. It’s a
decision-making function that accepts an attribute
values vector as input and returns a decision. It can
be used to address problems involving regression as
well as classification. A decision tree makes a
decision by performing a set of operations [14]. It
can work with numerical and categorical features
and easy to understand and interpret, visually
intuitive.
Figure 3: Decision Tree Classification
Algorithm.
There are two nodes in every decision tree. One is
decision node and the opposite is leaf node.
selection nodes have many branches and leaf nodes
are the resultant / final results of the decisions that
were taken to attain to its point. these choices are
made based on a given dataset(Figure 3).
4.3 Random forest
Random forest is a collection of decision trees
used for regression and classification. To determine
the majority vote, classification is used. The mean
value is calculated using regression. This method is
more accurate, resilient, and capable of handling a
wide range of data types, including binary,
category, and continuous data. Random Forest is
nothing more than a collection of decision trees. A
random forest is a meta-estimator (i.e., it integrates
the results of numerous forecasts) that aggregates
many decision trees and improves them. The
number of functions that can be split on each node
is limited to a percentage of the total (known as the
hyper parameter) [15]. Random Forest works in
two-phase first is to create the random forest by
combining N decision tree, and second is to make
predictions for each tree created in the first
phase(Figure 4).
Figure 4: Random Forest Algorithm
.
4.4 Support Vector Machines (SVM)
Support vector machines (SVMs, also known as
support vector networks) are supervised learning
WSEAS TRANSACTIONS on COMPUTERS
DOI: 10.37394/23205.2022.21.30
Kamel Maaloul, Lejdel Brahim
E-ISSN: 2224-2872
251
Volume 21, 2022
models that evaluate data and recognize patterns and
are used for classification and regression analysis in
machine learning. Support vector machines (SVMs)
are supervised learning models with associated
learning algorithms for classification and regression
analysis in machine learning. A non-probabilistic
binary linear classifier, an SVM training technique
builds a model that assigns new data measurements
to one of two categories [16].
The working of the SVM algorithm can be
understood by using an example. Suppose we have
a dataset that has two tags (green and blue), and the
dataset has two features x1 and x2. We want a
classifier that can classify the pair(x1, x2) of
coordinates in either green or blue(Figure 5).
Figure 5: Support Vector Machine Algorithm
4.5 Gradient boosting
Gradient boosting is a set of machine learning
algorithms that merge a variety of weak learning
models to generate a strong predictive model. When
doing gradient boosting, decision trees are
commonly employed. Gradient boosting models are
getting popular as a measure of the ability to classify
complex datasets [17]. It is a regression and
classification machine learning technique that
generates a prediction model in the form of an
ensemble of weak prediction models, often decision
trees.
A Gradient Boosting Machine or GBM
combines the predictions from multiple decision
trees to generate the final predictions. Keep in
mind that all the weak learners in a gradient
boosting machine are decision trees(Figure 6).
Figure 6: Gradient boosting Algorithm.
5 Methodology
Air pollution is a phenomenon, which is affected
by different factors. In order for precise prediction,
right identification of these parameters influencing
the air pollution is necessary. We evaluate the high
level aspects of the data more generally before
starting a machine learning examination of the
data. PM2.5 levels are also more concentrated on
weekdays, as can be observed. As a result, it is
clear that PM2.5 is influenced by human activity,
and that it is especially linked to traffic levels.
5.1 Dataset
The pollutant data: This information was acquired
in order to train the algorithm to detect air quality.
CarbonMonoxide(CO), Sulphurdi oxide(SO2), and
Ozone (O3) were to be included in the data set
for 74 cities in China daily data from Oct. 2013
to April.2015 are used for air quality index
prediction., are obtained from link
http://106.37.208.233:20035/.
Meteorological data: Temperature, Wind Speed,
Humidity, and Wind Direction are the
meteorological data information set parameters that
are utilized to train the system for 74 cities.
WSEAS TRANSACTIONS on COMPUTERS
DOI: 10.37394/23205.2022.21.30
Kamel Maaloul, Lejdel Brahim
E-ISSN: 2224-2872
252
Volume 21, 2022
5.2 Data Preprocessing
Data preprocessing in machine learning refers to the
technique of preparing (cleaning and organizing)
raw data to make it fit to build and train machine
learning models. Often, data preprocessing is the
most important phase of a machine learning project.
Data preprocessing is a step in the data mining and
data analysis process that takes raw data and
transforms it into a format that can be understood
and analyzed by computers and machine learning.
6 Experiments
6.1 Results
The results obtained from the test were analyzed and
as compared to decide the exceptional performing
set of rules .The pc used for the test was an Intel i3-
9100 with 16GB RAM and LGA 1151 card. We
trained 5 models primarily based on decision Tree,
Linear Regression, Random forest, SVM, and GBC.
After education the 5 ML models, we provided them
with trying out facts set for prediction.
We carried out the distinctive machine learning
algorithms in Python the use of Jupyter notebook.
the following plot indicates that all the capabilities
that are considered for the prediction are correlated
and as a consequence can be taken into
consideration to teach the model.
For the CO prediction, we get prediction
accuracy as follows
Model Name
Prediction
Linear Regression
0.06%
Decision Tree
0.69%
Random forest
0.91%
Support Vector Machine
0.80%
Gradient Boosting
0.36%
Table 1: Prediction Probability of CO
From the previous table 1, it is clear that the
Random forest algorithm is the best for others. To
further clarify the results, we represented them with
the following graph (Figure 7), which shows the
extent of variation between the various applied
machine learning algorithms:
Figure 7: CO prediction probability.
For the SO2 prediction, we get prediction
accuracy as follows
Model Name
Linear Regression
Decision Tree
Random forest
Support Vector Machine
Gradient Boosting
Table 2: Prediction Probability of SO2
Figure 8: SO2 prediction probability.
From the previous table 2, it is clear that the
Random forest algorithm is the best for others. To
further clarify the results, we represented them with
WSEAS TRANSACTIONS on COMPUTERS
DOI: 10.37394/23205.2022.21.30
Kamel Maaloul, Lejdel Brahim
E-ISSN: 2224-2872
253
Volume 21, 2022
the following graph(Figure 8), which shows the
extent of variation between the various applied
machine learning algorithms.
For the O3 prediction, we get prediction
accuracy as follow
Model Name
Prediction
Linear Regression
0.21%
Decision Tree
0.55%
Random forest
0.87%
Support Vector Machine
0.63%
Gradient Boosting
0.42%
Table 3: Prediction Probability of O3
From the previous table 3, it is clear that the
Random forest algorithm is the best for others. To
further clarify the results, we represented them with
the following graph(Figure 9), which shows the
extent of variation between the various applied
machine learning algorithms:
Figure 9: O3 prediction probability.
6.2 Discussion
From the previous tables, it can be seen that the:
Linear Regression algorithm has much
weaker indicators compared to the other
models. This is due to the presence of
several outliers and outliers from other data
points.
Decision Tree algorithm is a class capable of
performing multi-class classification on a
data set. They perform well even with the
model from which the data is generated. But
they are very complex and do not generalize
well.
The Support Vector Machine algorithm has
good results. It is more productive in high
dimensional spaces and when there is a
margin of separation between categories as
in our case.
The Gradient Boosting algorithm has
average results. The training process takes
longer due to the fact that trees are built
sequentially. It does not work well with
sparse data.
Random Forest has the advantage of being
the easiest to understand and use of the five
machine learning algorithms tested. Random
Forest also outperforms other strategies in
terms of processing time. However, it
should be noted that the performance does
not compare favorably to that of other
models. Random Forest is a multi-tree
ensemble approach. By integrating several
trees, it lowers the over fitting of single
trees. The peak values were identified using
this model. Furthermore, the processing time
was shorter than that of other models.
For the dataset, Random Forest algorithm performs
the best among all five algorithms. We have
validated our approach with field trials and have
shown the performance comparison against different
algorithms.
Through graphs, Random Forest is an efficient
algorithm capable of detecting air quality. As shown
in Figures 7,8 and 9, Random Forest algorithm made
more accurate predictions than other algorithms.
7 Conclusion
We use machine learning algorithms such as linear
regression, Decision Tree, Support Vector Machine
,Random Forest (RF) and Gradient Boosting to
forecast the air quality index. We determined that
the RF algorithm provides a better prediction of the
air quality index based on the data. RF train each
tree independently, using a random sample of the
WSEAS TRANSACTIONS on COMPUTERS
DOI: 10.37394/23205.2022.21.30
Kamel Maaloul, Lejdel Brahim
E-ISSN: 2224-2872
254
Volume 21, 2022
data. This randomness helps to make the model
more robust than a single decision tree, and less
likely to adapt on the training data. There are
typically two parameters in RF number of trees and
number of features to be selected at each node.
We plan to examine the performance of these
strategies in a multi-core context in the future. We
also plan to look at the other elements that contribute
to air pollution. The use of supervised learning is
also being expanded, as is unsupervised learning. In
the future, projected architecture can be integrated
into a real-time environment once pollution
measurement equipment is installed in smart cities.
References:
[1] Ameer, Saba and Shah, Munam Ali and Khan,
Abid and Song, Houbing and Maple, Carsten and
Islam, Saif Ul and Asghar, Muhammad Nabeel.
Comparative analysis of machine learning
techniques for predicting air quality in smart
cities.
IEEE Access,
2019, No. 7, pp.128325–
128338.
[2]
J. Sentian, F. Herman, C. Y. Yin and J. C. H.
Wui
.
Long-term air pollution trend analysis in
Malaysia
.
International Journal of
Environmental Impacts, 2019, No.2, pp.309–324.
[3]
Lu, Weizhen and Wang, Wenjian and Leung,
Andrew YT and Lo, Siu-Ming and Yuen, Richard
KK and Xu, Zongben and Fan, Huiyuan. Air
pollutant parameter forecasting using support
vector machines.
In
Proceedings of the 2002
International Joint Conference on Neural
Networks. IJCNN’02 (Cat. No.
02CH37290
);
Publishing: IEEE
,
2002; pp. 630
635.
[4] Iskandaryan, Ditsuhi and Ramos, Francisco and
Trilles, Sergio.Air quality prediction in smart
cities using machine learning technologies based
on sensor data: a review.
Ap
plied Sciences,
2020,
No.10,
pp. 2401.
[5] Mart´ınez-Espan˜a, Raquel and Bueno-Crespo,
Andres and Timon-Perez, Isabel Maria and
Soto, Jesu´s A and Ortega, Andr´es Mun˜oz
and Cecilia, Jose M.Air-Pollution Prediction
in Smart Cities through Machine Learning
Methods: A Case of Study in Murcia, Spain.
J.
Univers. Comput. Sci
., 2018, No.24, pp.261–276 .
[6] Hamami, Faqih and Dahlan, Iqbal Ahmad. Air
Quality Classification in Urban Environment
using Ma- chine Learning Approach.
In IOP
Conference Series: Earth and Environmental
Science
;
Publishing: IOP Publishing, 2022; pp.
012004.
[7] Fernando, RM and Ilmini, WMKS and
Vidanagama, DU.Prediction of Air Quality
Index in Colombo.2022.
[8] Abirami, G and Girija, R and Das, Anindya and
Sreenivasan, Navneeth. Predicting Air Quality
Index with Machine Learning Models. In
Machine Learning and Deep Learning in
Efficacy Improvement of Healthcare
Systems;Publishing: CRC Press, 2022; pp. 353–
371.
[9] Murugan, Rishanti and Palanichamy, Naveen.
Smart City Air Quality Prediction using
Machine Learn- ing. In
2021 5th International
Conference on Intelligent Computing and
Control Systems (ICI- CCS);Publishing: IEEE,
2021; pp. 1048–1054.
[10] Sinnott, Richard O and Guan, Ziyue. Prediction
of air pollution through machine learning
approaches on the cloud. In
2018 IEEE/ACM
5th International Conference on Big Data
Computing Applications and Technologies
(BDCAT);Publishing: IEEE, 2018; pp. 51–60.
[11] Pasupuleti, Venkat Rao and Kalyan, Pavan and
Reddy, Hari Kiran and others. Air quality
prediction of data log by machine learning. In
2018 2020 6th International Conference on
Advanced Computing and Communication
Systems (ICACCS);Publishing: IEEE, 2020; pp.
1395–1399.
[12] Freedman, David A.Statistical models: theory
and practice. 2009, cambridge university press .
[13] Maimon, Oded Z and Rokach, Lior. Data
mining with decision trees: theory and
applications.
Publishing: World scientific, 2014.
[14] Breiman, Leo. Bagging predictors.
Springer
1996, 24, 123–140 .
[15] Zhao, Zhongliang and Carrera, Jose and
Niklaus, Joel and Braun, Torsten. Machine
Learning-Based Real-Time Indoor Landmark
Localization. In
International Conference on
Wired/Wireless Internet
WSEAS TRANSACTIONS on COMPUTERS
DOI: 10.37394/23205.2022.21.30
Kamel Maaloul, Lejdel Brahim
E-ISSN: 2224-2872
255
Volume 21, 2022
Communication;Publishing : Elsevier, 2018; pp.
95–106.
[16] Zhang, Yanru and Haghani, Ali. A gradient
boosting method to improve travel time
prediction. Trans- portation Research Part C:
Emerging Technologies,
2015, 58, 308–324.
[17] Vu, Tuan V and Shi, Zongbo and Cheng, Jing
and Zhang, Qiang and He, Kebin and Wang,
Shuxiao and Harrison, Roy M. Assessing the
impact of clean air action on air quality trends
in Beijing using a machine learning technique.
Copernicus GmbH,
2019, 19, 11303–11314 .
Contribution of individual
authors to the creation of a
scientific article (ghostwriting
policy)
Kamel, Brahim carried out the simulation
and the optimization.
Kamel has implemented the Algorithms .
Kamel has organized and executed the
experiments of Section 6.
Creative Commons Attribution License 4.0
(Attribution 4.0 International, CC BY 4.0)
This article is published under the terms of the Creative
Commons Attribution License 4.0
https://creativecommons.org/licenses/by/4.0/deed.en_US
WSEAS TRANSACTIONS on COMPUTERS
DOI: 10.37394/23205.2022.21.30
Kamel Maaloul, Lejdel Brahim
E-ISSN: 2224-2872
256
Volume 21, 2022