Machine Learning Model for Offensive Speech Detection in Online
Social Networks Slang Content
FETHI FKIH1, TAREK MOULAHI2, ABDULATIF ALABDULATIF1
1Department of Computer Science, College of Computer, Qassim University, Buraydah 52571,
SAUDI ARABIA
2Department of Information Technology, College of Computer, Qassim University, Buraydah 52571,
SAUDI ARABIA
Abstract: - The majority of the world’s population (about 4 billion people) now uses social media such as
Facebook, Twitter, Instagram, and others. Social media has evolved into a vital form of communication,
allowing individuals to interact with each other and share their knowledge and experiences. On the other hand,
social media can be a source of malevolent conduct. In fact, nasty and criminal activity, such as cyberbullying
and threatening, has grown increasingly common on social media, particularly among those who use Arabic.
Detecting such behavior, however, is a difficult endeavor since it involves natural language, particularly
Arabic, which is grammatically and syntactically rich and fruitful. Furthermore, social network users frequently
employ Arabic slang and fail to correct obvious grammatical norms, making automatic recognition of bullying
difficult. Meanwhile, only a few research studies in Arabic have addressed this issue. The goal of this study is
to develop a method for recognizing and detecting Arabic slang offensive speech in Online Social Networks
(OSNs). As a result, we propose an effective strategy based on the combination of Artificial Intelligence and
statistical approach due to the difficulty of setting linguistic or semantic rules for modeling Arabic slang due to
the absence of grammatical rules. An experimental study comparing frequent machine learning tools shows that
Random Forest (RF) outperforms others in terms of precision (90%), recall (90%), and f1-score (90%).
Key-Words: - Cyberbullying; offensive speech detection; Arabic social media; Classifications, Machine
Learning, Social Network, Arabic slang.
Received: April 9, 2022. Revised: November 12, 2022. Accepted: December 13, 2022. Published: January 17, 2023.
1 Introduction
In the last decade, the use of social media in the
world is in exponentially grown. In fact, more than
half of the world’s population (about 4 billion
people), use social media such as Facebook, Twitter,
Instagram, etc. These tools are becoming a very
important communication means allowing people to
connect to each other and exchange their knowledge
and experiences. Unfortunately, social media are
also a source of intellectual extremism,
cyberbullying, and violation that can be normal or
death threats, racism, insults, bullying, or any kind
of terrorist acts. The detection of social medial
threats is a challenging problem due to the
following:
There are an exponential amount of information,
tweets, posts, comments, instant chat, and blogs
on social media,
There are few types of research related to
Arabic languages in this field,
Most of the Arabic users are using the local
dialect instead of the standard Arabic language.
Consequently, in this research work, we intend to
propose an efficient method based on a combination
of Artificial Intelligence techniques and statistical
features since it is very difficult to set linguistic or
semantic rules for modeling Arabic slang because
there is no clear grammatical rule. The proposed
approach will use the statistical approach in order to
ensure optimal performance for the system. In the
next section, we highlight and discuss the most
relevant proposed work in this context where AI
tools have been combined with many types of data
set extracted from the relevant social network
(YouTube, Facebook, Twitter, and Instagram) in
order to deal with social media threats.
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2023.20.2
Fethi Fkih, Tarek Moulahi,
Abdulatif Alabdulatif
E-ISSN: 2224-3402
7
Volume 20, 2023
Fig. 1: Most common languages used on the internet
1.1 Motivation
Fig. 1 outlines recent statistics, [1], studying the
most common language used on the internet,
showing that Arabic is in the fourth rank with 5.20%
of the total Internet content. This important average
is not followed by similar research efforts to analyze
and study this language. In addition, a part of
Internet users is preferring Arabic slang which is
issued from classical Arabic in addition to other
natural languages. This fact complicates content
analysis and recognition.
On the other hand, the most important amount of
Arabic internet content is located in social networks.
Although their advantages are to connect people by
communicating, collaborating, and exchanging
ideas, social network is becoming a source of
cyberbullies, offensive speech, and threats. These
problems are hard to be followed and controlled
manually. This makes the task of analyzing the
content so important and challenging due to
previously mentioned conditions in Arabic
languages.
1.2 Contribution
The keys contributions of this work are:
1. To propose a purely statistical approach for
detecting hate speech and offensive social networks
in Arabic slang content since it is very difficult to
set grammatical rules for it.
2. To prepare a dataset containing Arabic slang
tweets and posts to be fit for classification use based
on the statistical approach defined previously.
3. To deploy a set of machine learning approaches
which are: Logistic Regression (LR), Decision Tree
(DT), k-nearest neighbors’ algorithm (k-NN),
Linear Discriminant Analysis (LDA), Multinomial
Naive Bayes (MNS), Gaussian Naive Bayes (GNB),
Support Vector Machines (SVM), Random Forest
(RF), and Neural Network (NN).
4. To compare the previously mentioned techniques
and extract the optimal performance to detect
cyberbullying, hate speech, and offensive Arabic
slang content.
1.3 Paper Organization
The remainder of this paper is organized as follows:
Section 2 outlines the related works. In section 3 we
describe the proposed model. Section 4 presents the
used dataset and discusses the experimental results.
Finally, the conclusion is given in section 5.
2 Related works
Intellectual extremism detection is considered a
recent direction of research in the Computer Science
domain. In fact, the extraction of emotions,
opinions, and sentiment from textual content has
emerged with the rise of the social network. In the
following paragraphs, we provide a brief description
of the main approaches used for intellectual
extremism and cyberbullying detection with a focus
on those based on Text Analysis.
Huang et al., [2], claimed that textual features are
not enough for efficiently detecting intellectual
extremism and cyberbullying. For this reason, they
proposed to integrate structural social network
features in order to improve the accuracy of the
system. The proposed approach analyzes the
structure between the user and several structural
features such as the number of friends, network
embeddedness, and relationship centrality.
Nandhini and Sheeba, [3], provided an approach
based on fuzzy logic and a genetic algorithm in
order to recognize cyberbullying words in social
media. For learning the classification algorithms, the
authors extracted two types of features: linguistic
(PoS) and numerical (frequency). The authors used
NLP tools for the phase of text preprocessing and
the phase of linguistic feature extraction.
In the same context, Nahar et al. proposed a
Machine Learning-based approach for detecting
abusive content on social networks, [4]. They used a
semi-supervised learning technique for decreasing
the number of training samples. For the
classification phase, the authors applied a fuzzy
SVM algorithm. As mentioned by the authors, this
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2023.20.2
Fethi Fkih, Tarek Moulahi,
Abdulatif Alabdulatif
E-ISSN: 2224-3402
8
technique is mainly designed for solving many
problems related to cyberbullying detection in real-
world situations like noisy and imbalanced data.
The work of Lee et al. applied Sentiment Analysis
techniques to messages and posts on Twitter, [5].
Practically, the authors proposed an auto-detection
model that used linguistic features, readability
(education level, age, and social status), sentiment
score, and information about the friendship network
of the target for predicting tweets containing
harassment or cyberbullying. For the classification
task, this approach applied Three Machine learning
algorithms: k-nearest neighbors, support vector
machine, and decision tree.
Alotaibi et al. introduced a Deep Learning- based
technique for detecting offensive and aggressive
behavior, [6]. For cleaning the text and extracting
linguistic features, the authors utilized Natural
Language Tools. Multichannel deep learning was
used for the classification phase, which consists of
three modules: bidirectional gated recurrent unit
(BiGRU), transformer block, and convolutional
neural network (CNN).
Akhter et al. proposed a Machine learning-based
model for detecting cyberbullying on social media,
[7], the model is learned through linguistic features
(PoS) extracted from the corpora using NLP tools.
In order to classify textual messages into three
classes: Shaming, Sexual harassment, and Racism.
For the classification phase, the system used a
hybrid model that combined a Multinomial Naïve
Bayes classifier and fuzzy logic.
In a different approach, [8], the authors coupled
intelligence techniques with specific web
technology problems in order to combat
cyberbullying. This approach used text analysis and
data mining techniques for the classification of posts
on social media.
In the same context, Haidar et al. applied Machine
Learning algorithms (Naïve Bayes and SVM) and
NLP tools to the Arabic language, [9]. A similar
approach proposed by the same authors, [10], was
applied to the Arabic language and provided modest
results.
Mohaouchane et al. were basing the deep learning
approach, [11], to detect offensive Language in the
content of Arabic social media. Motivated by the
problem of negative effects on users, the authors try
to discover automatically hate speech, demeaning
comments, or verbal attacks. They propose to use a
set of deep learning tools on a labeled YouTube
comments dataset. Although the accuracy results are
encouraging, there is a lack of comparison with
similar papers.
Omar et al. proposed a comparison between a set of
machine and deep learning techniques, [12], which
have been used to discover hate speech in Arabic
Online Social Networks (OSNs). The data has been
collected from a diversity of the most frequent
social networks (Twitter, Instagram, YouTube, and
Facebook). The authors conducted an experimental
study using a set of two deep learning architectures
and twelve machine learning algorithms. Based on
this study, they found that Recurrent Neural
Network (RNN) performed better than the other
reaching 98.7% accuracy.
Husain and Uzuner, [13], deal with the detection of
offensive language in OSN Arabic content. The
authors discuss and compare important proposed
techniques for studying this serious problem. They
concentrate on studying contributions mixing
between Natural Language Processing (NLP) and
machine learning models. After studying the state-
of-the-art in this field, the authors conclude that still
needs gaps and limitations to be treated. So, further
research effort has to develop novel benchmark
resources besides investigating more on the feature
extraction techniques and pre-processing.
ALBayari et al. study the problem of cyberbullying,
[14]. They show that most previous studies are
concentrating on the English language. They intend
to propose a review of classification methods used
to discover cyberbullying in Arabic texts. They
found gaps related to the few numbers of research in
that field in addition to the limitations linked to the
datasets themselves. Moreover, the majority of
proposed contributions to automatically detect
Arabic cyberbullying are based on Twitter, and most
of them are using the SVM classifier or CNN.
Fig. 2: Existing gaps and the goals of our paper
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2023.20.2
Fethi Fkih, Tarek Moulahi,
Abdulatif Alabdulatif
E-ISSN: 2224-3402
9
This literature review conducts us to summarize
gaps and limitations and link them to the goals of
our paper in Fig. 2.
3 Proposed Model
In this section, we make a general overview of
machine learning and the used classification
techniques. In addition, we outline how the dataset
was prepared.
3.1 Machine Learning Overview
Machine Learning (ML) is a subset of Artificial
Intelligence (AI) tools. AI is a simulation of human
intelligence. ML is linked to the use of probabilistic
mathematical formulas by machines to "learn" and
decide the output after exercising on a dataset of
inputs.
The basic ML steps are (1) Data collection, (2) Data
preparation, (3) Model training, (4) Model
evaluation and finally (5) Model Tuning.
There are essentially 4 types of ML models:
Supervised Learning Models: working with a
labeled dataset like Neural Networks, SVM,
Decision Trees, and Naïve Bayes, [15]. A
supervised learning algorithm aims to model
connections and dependencies between the input
features and the target prediction output.
Unsupervised Learning Models: which predict
outputs with no labels like Principal Component
Analysis (PCA), [16,17], for data reduction K-
means for clustering. These algorithms attempt
to employ techniques on the input data to find
patterns, aggregate and summarize the data
points, recognize patterns, and derive relevant
insights that help users understand the data
better.
Semi-Supervised Learning Models: This is a
hybrid approach from the previous two types,
like Generative Adversarial Network (GAN).
These techniques take advantage of the fact that,
despite the unlabeled data's unknown group
memberships, this data contains crucial details
about the group parameters.
Reinforcement Learning Models: is very close
to human learning based on driving the learning
process where a learner would work better, like
Q-learning. This technique tries to take
decisions that would maximize the reward or
minimize the risk utilizing observations
acquired from the interaction with the
environment. The agent, a reinforcement
learning algorithm, iteratively continually learns
from its surroundings, [18].
3.2 Execution Process
In our proposed model, the execution process is
performed essentially in three phases as shown in
Fig. 3:
Phase1:
Preparing the dataset by using a statistical approach
to create features describing the list of tweets,
comments, and posts. In addition to making a fair
distribution of classes to guarantee realistic behavior
and acceptable results.
Phase2:
Using 9 relevant machine and deep learning tools
for training and testing based on the previous dataset
with the aim of predicting tweets, comments, and
posts.
Phase3:
Compare and discuss the result of the technique
used previously. The comparison will be based on
precision, F1-score, and Recall. In case the result is
not satisfactory, go to phase 2.
Fig. 3: Phases of the proposed method
3.3 Features Preparation
In order to detect hate and offensive speech in
Arabic slang tweets, posts, and comments, we
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2023.20.2
Fethi Fkih, Tarek Moulahi,
Abdulatif Alabdulatif
E-ISSN: 2224-3402
10
propose Machine Learning methods using statistical
features, [19], [20]. This choice is motivated by
many reasons such as, it is very difficult to set
linguistic or semantic rules for modeling Arabic
slang since it doesn’t follow any grammatical rules.
Furthermore, the Slang language is very local even
in the same country. For all these mentioned
reasons, we choose to use a purely statistical
approach.
As we have mentioned, our approach used a set of
numerical features that will be integrated into
Machine Learning models. Table 1 presents the used
statistical features.
Table 1. Used features
Feature
Explanation
words_number
Number of words in the tweet
Char_number
Number of characters in the
tweet
CRLF
CR and LF are control
characters that are used to
mark a line break in the tweet
retweet_number
the number of retweets
Emoticons
An emoticon is a
representation of a human
facial expression using only
keyboard characters such as
letters, numbers, and
punctuation marks.
Emojis
An emoji is an image small
enough to insert into text that
expresses an emotion or idea
question_mark
?
interrogation_mark
!
dot_mark
Full stop.
Hashtag
The hashtag is used to
highlight keywords or topics
within a Tweet
URL
Uniform Resource Locator
4 Experimental Study
This section provides an experimental study for
evaluating our proposed model. In fact, the
extracted statistical features will be integrated into a
set of 9 well-known Machine Learning:
Logistic Regression: it is a classification model
rather than a regression model mainly (despite its
name) used for binary and linear classification
problems, [21].
Decision Tree: it provides a classification and
predictive model that can be easily graphically
presented. This model has the ability for handling
numerical and categorical data, [22].
k-Nearest Neighbors (KNN): This learning
model stores all available data points (examples)
and classifies new data points based on similarity
measures, [23].
Linear Discriminant Analysis (LDA): it is a very
common technique for dimensionality reduction
problems as a pre-processing step for machine
learning and pattern classification applications,
[24].
Multinomial Naive Bayes: it predicts the tag of
an observation, such as a word or a frequency or
PoS, using the Bayes model. It calculates each
tag’s likelihood for a given observation and
provides the tag with the highest chance, [25].
Gaussian Naive Bayes: Naive Bayes is a group of
supervised machine learning classification
algorithms based on the Bayes theorem. It is a
simple technique for constructing classifiers:
models that assign class labels to problem
instances, [26].
Support Vector Machine (SVM): it is a
supervised machine learning model that can be
used for classification and regression problems.
However, the support vector machine is
mathematically complex and computationally
expensive, [27].
Random Forest: Random Forest is a
computationally efficient technique that can
operate quickly over large datasets, [28].
Neural Network: it is inspired by the
sophisticated functionality of human brains
where hundreds of billions of interconnected
neurons process information in parallel, [29].
4.1 Dataset Description
In this work, we use an Arabic slang dataset named
OSACT2020-shared Task. This dataset contains
6964 tweets, comments, and posts that are manually
annotated for both classes: offensiveness (labels are:
OFF or NOT_OFF) and hate speech (labels are: HS
or NOT_HS). All information about this dataset is
available in Table 2.
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2023.20.2
Fethi Fkih, Tarek Moulahi,
Abdulatif Alabdulatif
E-ISSN: 2224-3402
11
Table 2. Distribution of different classes in the
OSACT2020-dataset
Classes
Number
of tweets
Ratio
Example
OFF
1323
19%
    
  
!!
NOT_OFF
5641
81%
    
   

HS
348
5%
  
  

NOT_HS
6616
95%
    
😍 🌹
As shown in Table 2, the distribution of classes in
the dataset is imbalanced. For instance, the ratio of
HS class is 5% which is considered a minority in the
dataset. For this reason, we have handled this issue
using SMOTE algorithm, [30], which is an
oversampling technique where the synthetic samples
are generated for the minority class. For this
purpose, we separated the two classes (HS and OFF)
into two different features file. Then, we applied the
SMOTE algorithm on each file which provide the
following balanced distribution (Table 3):
Table 3. New classes distribution after applying
SMOTE algorithm
Classes
Number of
records
Ratio
Total
number
OFF
1974
50%
3948
NOT_OFF
1974
50%
HS
2360
50%
4720
NOT_HS
2360
50%
4.2 Results
The classification results of the set of Machine
Learning models applied to the class HS/NOT_HS
are shown in Fig. 4 and Table 4. As we can notice,
Random Forest and Decision Tree models outscore
all the other models with an F1-score of 0.9 and
0.88, respectively. On the other hand, Gaussian
Naive Bayes and SVM models provide the worst
results with an F1-score of 0.61 and 0.69,
respectively.
In conclusion, a Precision and a Recall of 0.9 are
considered very good results due to the huge
linguistic challenges confronted when extracting
Hate speech from documents in Arabic Slang and
using only statistical features.
Table 4. Comparison between ML models for
HS/NOT_HS class. The best results are shown in
bold
ML Models
Precision
Recall
F1-
score
Logistic Regression
0.79
0.79
0.79
Decision Tree
0.88
0.88
0.88
KNN
0.84
0.83
0.83
Linear Discriminant
Analysis
0.78
0.78
0.78
Multinomial Naive
Bayes
0.76
0.74
0.74
Gaussian Naive Bayes
0.71
0.64
0.61
SVM
0.73
0.7
0.69
Random Forest
0.9
0.9
0.9
Neural Network
0.85
0.85
0.85
Regarding the class OFF/NOT_OFF, classification
results are very similar to the first class. As shown
in Fig. 5 and Table 5, Random Forest and Decision
Tree models are still the best and outscore all the
other in the threeevaluation metrics with an F1-
score of 0.75 and 0.72, respectively. Also, Gaussian
Naive Bayes and SVM provide the worst results for
this class with an F1-score of 0.53 and 0.61
respectively.
Fig. 4: Evaluation of ML models on HS/NOT_HS
class
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2023.20.2
Fethi Fkih, Tarek Moulahi,
Abdulatif Alabdulatif
E-ISSN: 2224-3402
12
The main interpretation that can be deduced from
these results is that the performance of the proposed
model decreased when handling offensive speech
versus hate speech. This result can be explained by
the fact that offensive speech is more complex and
difficult compared with hate speech and need more
sophisticated techniques than using simple statistical
features. Although statistical features are simple to
model and extract from textual content and make the
approach independent from any language, they are
incapable to recognize semantic relations and need
to be combined with semantic knowledge, such as
ontologies, [31], [32], [33].
Table 5. Comparison between ML models for
OFF/NOT_OFF class. The best results are shown in
bold
ML Models
Precision
Recall
F1-
score
Logistic Regression
0.71
0.71
0.71
Decision Tree
0.72
0.72
0.72
KNN
0.71
0.7
0.7
Linear Discriminant
Analysis
0.7
0.7
0.7
Multinomial Naive Bayes
0.7
0.69
0.69
Gaussian Naive Bayes
0.59
0.56
0.53
SVM
0.61
0.61
0.61
Random Forest
0.75
0.75
0.75
Neural Network
0.72
0.72
0.71
Fig. 5: Evaluation of ML models on HS/NOT_HS
class
5 Conclusion
The use of social media is becoming one of day
practical habits. As a source of news, ideas
exchanging and communications, it is also a source
of serious problems like messages of hate and
cyberbullying. Meanwhile, the Arabic language is
ranked in the 4th place of most commonly used
language in internet live content. In return, few
research works are addressing this problem in the
context of the Arabic language and insufficient
research works are dealing with this issue in Arabic
slang.
All of that motivates us to propose, in this paper, a
purely statistical approach for detecting and
predicting cyberbullying, hate speech, and offensive
tweets, comments, and posts. Our proposed methods
provided good results with a prepared Arabic slang
dataset. In fact, results show that our method
provided the optimal performance when using
Random Forests and Decision Trees as classification
models.
In future works, we plan to improve the detection
results by working more on the dataset. This can be
performed by integrating (Natural Language
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2023.20.2
Fethi Fkih, Tarek Moulahi,
Abdulatif Alabdulatif
E-ISSN: 2224-3402
13
Processing) NLP rules with the statistical approach
for data preparation.
Acknowledgement:
The researchers would like to thank the Deanship of
Scientific Research, Qassim University for funding
the publication of this project.
References:
[1] Statista, “Most common languages used on
the internet as of January 2020, by share of
internet users,” 2020. [Online]. Available:
https://www.statista.com/statistics/262946/sha
re-of-the-most-common-languages-on-the-
internet/
[2] Q. Huang, V. K. Singh, and P. K. Atrey,
“Cyber bullying detection using social and
textual analysis,” in Proceedings of the 3rd
International Workshop on Socially-aware
Multimedia, Orlando, Florida, USA, pp. 36,
2014.
[3] B. S. Nandhini and J. Sheeba, “Online social
network bullying detection using intelligence
techniques,” Procedia Computer Science, vol.
45, pp. 485492, 2015.
[4] V. Nahar, S. Al-Maskari, X. Li, and C. Pang,
“Semi-supervised learning for cyberbullying
detection in social networks,” in Australasian
Database Conference, Brisbane, QLD,
Australia, pp. 160171, Springer, 2014.
[5] P.-J. Lee, Y.-H. Hu, K. Chen, J. M. Tarn, and
L.-E. Cheng, “Cyberbullying detection on
social network services,” in PACIS 2018
Proceedings, Yokohama, Japan, vol. 61,
2018.
[6] M. Alotaibi, B. Alotaibi, and A. Razaque, “A
multichannel deep learning framework for
cyberbullying detection on social media,”
Electronics, vol. 10, no. 21, pp. 114, 2021.
[7] A. Akhter, U. K. Acharjee, and M. M. A.
Polash, “Cyber bullying detection and
classification using multinomial naïve bayes
and fuzzy logic,” Int. J. Math. Sci. Comput,
vol. 5, pp. 112, 2019.
[8] A. Ioannou, J. Blackburn, G. Stringhini, E. De
Cristofaro, N. Kourtellis, and M. Sirivianos,
“From risk factors to detection and
intervention: a practical proposal for future
work on cyberbullying,” Behaviour &
Information Technology, vol. 37, no. 3, pp.
258266, 2018.
[9] B. Haidar, M. Chamoun, and A. Serhrouchni,
“A multilingual system for cyberbullying
detection: Arabic content detection using
machine learning,” Advances in Science,
Technology and Engineering Systems Journal,
vol. 2, no. 6, pp. 275284, 2017.
[10] B. Haidar, M. Chamoun, and A. Serhrouchni,
“Multilingual cyberbullying detection system:
Detecting cyberbullying in arabic content,” in
2017 1st Cyber Security in Networking
Conference (CSNet), Rio de Janeiro, Brazil,
pp. 18, IEEE, 2017.
[11] H. Mohaouchane, A. Mourhir, and N. S.
Nikolov, “Detecting offensive language on
arabic social media using deep learning,” in
2019 Sixth International Conference on Social
Networks Analysis, management and security
(SNAMS), Granada, Spain, pp. 466471,
IEEE, 2019.
[12] A. Omar, T. M. Mahmoud, and T. Abd-El-
Hafeez, “Comparative performance of
machine learning and deep learning
algorithms for arabic hate speech detection in
osns,” in The International Conference on
Artificial Intelligence and Computer Vision,
Cairo, Egypt, pp. 247257, Springer, 2020.
[13] F. Husain and O. Uzuner, “A survey of
offensive language detection for the arabic
language,” ACM Transactions on Asian and
Low-Resource Language Information
Processing (TALLIP), vol. 20, no. 1, pp. 144,
2021.
[14] R. ALBayari, S. Abdullah, and S. A. Salloum,
“Cyberbullying classification methods for
arabic: A systematic review,” in The
International Conference on Artificial
Intelligence and Computer Vision, Settat,
Morocco, pp. 375385, Springer, 2021.
[15] S. Zidi, T. Moulahi, and B. Alaya, “Fault
detection in wireless sensor networks through
svm classifier,” IEEE Sensors Journal, vol.
18, no. 1, pp. 340347, 2017.
[16] T. Moulahi, “Joining formal concept analysis
to feature extraction for data pruning in cloud
of things,” The Computer Journal, pp. 19,
2021.
[17] T. Moulahi, S. El Khediri, R. U. Khan, and S.
Zidi, “A fog computing data reduce level to
enhance the cloud of things performance,”
International Journal of Communication
Systems, vol. 34, no. 9, pp. 113, 2021.
[18] A. Mchergui and T. Moulahi, “A novel deep
reinforcement learning based relay selection
for broadcasting in vehicular ad hoc
networks,” IEEE Access, vol. 10, pp. 112
121, 2021.
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2023.20.2
Fethi Fkih, Tarek Moulahi,
Abdulatif Alabdulatif
E-ISSN: 2224-3402
14
[19] F. Fkih and M. N. Omri, “Information
retrieval from unstructured web text document
based on automatic learning of the threshold,
International Journal of Information Retrieval
Research (IJIRR), vol. 2, no. 4, pp. 1230,
2012.
[20] F. Fkih and M. N. Omri, “Hidden data states-
based complex terminology extraction from
textual web data model,” Applied Intelligence,
vol. 50, no. 6, pp. 18131831, 2020.
[21] A. Subasi, Practical Machine Learning for
Data Analysis Using Python. Academic Press,
2020. [Online]Available:
https://www.sciencedirect.com/book/9780128
213797/practical-machine-learning-for-data-
analysis-using-python
[22] V. Matzavela and E. Alepis, “Decision tree
learning through a predictive model for
student academic performance in intelligent
m-learning environments,” Computers and
Education: Artificial Intelligence, vol. 2, p.
100035, 2021.
[23] I. Saini, D. Singh, and A. Khosla, “Qrs
detection using k-nearest neighbor algorithm
(knn) and evaluation on standard ecg
databases,” Journal of Advanced Research,
vol. 4, no. 4, pp. 331344, 2013.
[24] A. Tharwat, T. Gaber, A. Ibrahim, and A. E.
Hassanien, “Linear discriminant analysis: A
detailed tutorial,” AI Communications, vol.
30, no. 2, pp. 169190, 2017.
[25] A. M. Kibriya, E. Frank, B. Pfahringer, and G.
Holmes, “Multinomial naive bayes for text
categorization revisited,” in Australasian
Joint Conference on Artificial Intelligence,
Canberra, ACT, Australia, pp. 488-499,
Springer, 2004.
[26] C. Bustamante, L. Garrido, and R. Soto,
“Comparing fuzzy naive bayes and gaussian
naive bayes for decision making in robocup
3d,” in Mexican International Conference on
Artificial Intelligence, Mexico City, Mexico,
pp. 237 247, Springer, 2006.
[27] S. Suthaharan, “Machine learning models and
algorithms for big data classification,” Integr.
Ser. Inf. Syst, vol. 36, pp. 112, 2016.
[28] T. M. Oshiro, P. S. Perez, and J. A.
Baranauskas, “How many trees in a random
forest?”, in International Workshop on
Machine Learning and Data Mining in
Pattern Recognition, Berlin, Germany, pp.
154168, Springer, 2012.
[29] S.-C. Wang, “Artificial neural network,” in
Interdisciplinary Computing in Java
Programming, pp. 81 100, Springer, 2003.
[30] N. V. Chawla, K. W. Bowyer, L. O. Hall, and
W. P. Kegelmeyer, “Smote: synthetic
minority oversampling technique,” Journal of
Artificial Intelligence Research, vol. 16, pp.
321357, 2002.
[31] F. Fkih and M. N. Omri, “Estimation of a
priori decision threshold for collocations
extraction: an empirical study,” International
Journal of Information Technology and Web
Engineering (IJITWE), vol. 8, no. 3, pp. 34
49, 2013.
[32] F. Fkih and M. N. Omri, “Hybridization of an
index based on concept lattice with a
terminology extraction model for semantic
information retrieval guided by wordnet,” in
International Conference on Hybrid
Intelligent Systems, Marrakech, Morocco, pp.
144152, Springer, 2016.
[33] F. Fkih, M. N. Omri, and I. Toumia, “A
linguistic model for terminology extraction
based conditional random field,” in:
Proceedings of the International Conference
on Computer Related Knowledge, ICCRK
2012, Sousse, Tunisia, pp. 38, 2012.
Contribution of Individual Authors to the
Creation of a Scientific Article (Ghostwriting
Policy)
Fethi Fkih wrote the original draft of the paper and
carried out the simulation, the formal analysis, and
the optimization.
Tarek Moulahi has defined the methodology, and
reviewed, and edited the paper.
Abdulatif AlAbdulatif was responsible for the
project administration and the funding acquisition.
Sources of Funding for Research Presented in a
Scientific Article or Scientific Article Itself
This work was funded by the Deanship of Scientific
Research, Qassim University, Saudi Arabia.
Creative Commons Attribution License 4.0
(Attribution 4.0 International, CC BY 4.0)
This article is published under the terms of the
Creative Commons Attribution License 4.0
https://creativecommons.org/licenses/by/4.0/deed.en
_US
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2023.20.2
Fethi Fkih, Tarek Moulahi,
Abdulatif Alabdulatif
E-ISSN: 2224-3402
15
Conflict of Interest
The authors have no conflicts of interest to declare
that are relevant to the content of this article.