Machine Learning Model for Offensive Speech Detection in Online

Social Networks Slang Content

FETHI FKIH1, TAREK MOULAHI2, ABDULATIF ALABDULATIF1

1Department of Computer Science, College of Computer, Qassim University, Buraydah 52571,

SAUDI ARABIA

2Department of Information Technology, College of Computer, Qassim University, Buraydah 52571,

SAUDI ARABIA

Abstract: - The majority of the world’s population (about 4 billion people) now uses social media such as

Facebook, Twitter, Instagram, and others. Social media has evolved into a vital form of communication,

allowing individuals to interact with each other and share their knowledge and experiences. On the other hand,

social media can be a source of malevolent conduct. In fact, nasty and criminal activity, such as cyberbullying

and threatening, has grown increasingly common on social media, particularly among those who use Arabic.

Detecting such behavior, however, is a difficult endeavor since it involves natural language, particularly

Arabic, which is grammatically and syntactically rich and fruitful. Furthermore, social network users frequently

employ Arabic slang and fail to correct obvious grammatical norms, making automatic recognition of bullying

difficult. Meanwhile, only a few research studies in Arabic have addressed this issue. The goal of this study is

to develop a method for recognizing and detecting Arabic slang offensive speech in Online Social Networks

(OSNs). As a result, we propose an effective strategy based on the combination of Artificial Intelligence and

statistical approach due to the difficulty of setting linguistic or semantic rules for modeling Arabic slang due to

the absence of grammatical rules. An experimental study comparing frequent machine learning tools shows that

Random Forest (RF) outperforms others in terms of precision (90%), recall (90%), and f1-score (90%).

Key-Words: - Cyberbullying; offensive speech detection; Arabic social media; Classifications, Machine

Learning, Social Network, Arabic slang.

Received: April 9, 2022. Revised: November 12, 2022. Accepted: December 13, 2022. Published: January 17, 2023.

1 Introduction

In the last decade, the use of social media in the

world is in exponentially grown. In fact, more than

half of the world’s population (about 4 billion

people), use social media such as Facebook, Twitter,

Instagram, etc. These tools are becoming a very

important communication means allowing people to

connect to each other and exchange their knowledge

and experiences. Unfortunately, social media are

also a source of intellectual extremism,

cyberbullying, and violation that can be normal or

death threats, racism, insults, bullying, or any kind

of terrorist acts. The detection of social medial

threats is a challenging problem due to the

following:

 There are an exponential amount of information,

tweets, posts, comments, instant chat, and blogs

on social media,

 There are few types of research related to

Arabic languages in this field,

 Most of the Arabic users are using the local

dialect instead of the standard Arabic language.

Consequently, in this research work, we intend to

propose an efficient method based on a combination

of Artificial Intelligence techniques and statistical

features since it is very difficult to set linguistic or

semantic rules for modeling Arabic slang because

there is no clear grammatical rule. The proposed

approach will use the statistical approach in order to

ensure optimal performance for the system. In the

next section, we highlight and discuss the most

relevant proposed work in this context where AI

tools have been combined with many types of data

set extracted from the relevant social network

(YouTube, Facebook, Twitter, and Instagram) in

order to deal with social media threats.

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2023.20.2

Fethi Fkih, Tarek Moulahi,

Abdulatif Alabdulatif

E-ISSN: 2224-3402

Volume 20, 2023

Fig. 1: Most common languages used on the internet

1.1 Motivation

Fig. 1 outlines recent statistics, [1], studying the

most common language used on the internet,

showing that Arabic is in the fourth rank with 5.20%

of the total Internet content. This important average

is not followed by similar research efforts to analyze

and study this language. In addition, a part of

Internet users is preferring Arabic slang which is

issued from classical Arabic in addition to other

natural languages. This fact complicates content

analysis and recognition.

On the other hand, the most important amount of

Arabic internet content is located in social networks.

Although their advantages are to connect people by

communicating, collaborating, and exchanging

ideas, social network is becoming a source of

cyberbullies, offensive speech, and threats. These

problems are hard to be followed and controlled

manually. This makes the task of analyzing the

content so important and challenging due to

previously mentioned conditions in Arabic

languages.

1.2 Contribution

The keys contributions of this work are:

1. To propose a purely statistical approach for

detecting hate speech and offensive social networks

in Arabic slang content since it is very difficult to

set grammatical rules for it.

2. To prepare a dataset containing Arabic slang

tweets and posts to be fit for classification use based

on the statistical approach defined previously.

3. To deploy a set of machine learning approaches

which are: Logistic Regression (LR), Decision Tree

(DT), k-nearest neighbors’ algorithm (k-NN),

Linear Discriminant Analysis (LDA), Multinomial

Naive Bayes (MNS), Gaussian Naive Bayes (GNB),

Support Vector Machines (SVM), Random Forest

(RF), and Neural Network (NN).

4. To compare the previously mentioned techniques

and extract the optimal performance to detect

cyberbullying, hate speech, and offensive Arabic

slang content.

1.3 Paper Organization

The remainder of this paper is organized as follows:

Section 2 outlines the related works. In section 3 we

describe the proposed model. Section 4 presents the

used dataset and discusses the experimental results.

Finally, the conclusion is given in section 5.

2 Related works

Intellectual extremism detection is considered a

recent direction of research in the Computer Science

domain. In fact, the extraction of emotions,

opinions, and sentiment from textual content has

emerged with the rise of the social network. In the

following paragraphs, we provide a brief description

of the main approaches used for intellectual

extremism and cyberbullying detection with a focus

on those based on Text Analysis.

Huang et al., [2], claimed that textual features are

not enough for efficiently detecting intellectual

extremism and cyberbullying. For this reason, they

proposed to integrate structural social network

features in order to improve the accuracy of the

system. The proposed approach analyzes the

structure between the user and several structural

features such as the number of friends, network

embeddedness, and relationship centrality.

Nandhini and Sheeba, [3], provided an approach

based on fuzzy logic and a genetic algorithm in

order to recognize cyberbullying words in social

media. For learning the classification algorithms, the

authors extracted two types of features: linguistic

(PoS) and numerical (frequency). The authors used

NLP tools for the phase of text preprocessing and

the phase of linguistic feature extraction.

In the same context, Nahar et al. proposed a

Machine Learning-based approach for detecting

abusive content on social networks, [4]. They used a

semi-supervised learning technique for decreasing

the number of training samples. For the

classification phase, the authors applied a fuzzy

SVM algorithm. As mentioned by the authors, this

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2023.20.2

Fethi Fkih, Tarek Moulahi,

Abdulatif Alabdulatif

E-ISSN: 2224-3402

Volume 20, 2023

technique is mainly designed for solving many

problems related to cyberbullying detection in real-

world situations like noisy and imbalanced data.

The work of Lee et al. applied Sentiment Analysis

techniques to messages and posts on Twitter, [5].

Practically, the authors proposed an auto-detection

model that used linguistic features, readability

(education level, age, and social status), sentiment

score, and information about the friendship network

of the target for predicting tweets containing

harassment or cyberbullying. For the classification

task, this approach applied Three Machine learning

algorithms: k-nearest neighbors, support vector

machine, and decision tree.

Alotaibi et al. introduced a Deep Learning- based

technique for detecting offensive and aggressive

behavior, [6]. For cleaning the text and extracting

linguistic features, the authors utilized Natural

Language Tools. Multichannel deep learning was

used for the classification phase, which consists of

three modules: bidirectional gated recurrent unit

(BiGRU), transformer block, and convolutional

neural network (CNN).

Akhter et al. proposed a Machine learning-based

model for detecting cyberbullying on social media,

[7], the model is learned through linguistic features

(PoS) extracted from the corpora using NLP tools.

In order to classify textual messages into three

classes: Shaming, Sexual harassment, and Racism.

For the classification phase, the system used a

hybrid model that combined a Multinomial Naïve

Bayes classifier and fuzzy logic.

In a different approach, [8], the authors coupled

intelligence techniques with specific web

technology problems in order to combat

cyberbullying. This approach used text analysis and

data mining techniques for the classification of posts

on social media.

In the same context, Haidar et al. applied Machine

Learning algorithms (Naïve Bayes and SVM) and

NLP tools to the Arabic language, [9]. A similar

approach proposed by the same authors, [10], was

applied to the Arabic language and provided modest

results.

Mohaouchane et al. were basing the deep learning

approach, [11], to detect offensive Language in the

content of Arabic social media. Motivated by the

problem of negative effects on users, the authors try

to discover automatically hate speech, demeaning

comments, or verbal attacks. They propose to use a

set of deep learning tools on a labeled YouTube

comments dataset. Although the accuracy results are

encouraging, there is a lack of comparison with

similar papers.

Omar et al. proposed a comparison between a set of

machine and deep learning techniques, [12], which

have been used to discover hate speech in Arabic

Online Social Networks (OSNs). The data has been

collected from a diversity of the most frequent

social networks (Twitter, Instagram, YouTube, and

Facebook). The authors conducted an experimental

study using a set of two deep learning architectures

and twelve machine learning algorithms. Based on

this study, they found that Recurrent Neural

Network (RNN) performed better than the other

reaching 98.7% accuracy.

Husain and Uzuner, [13], deal with the detection of

offensive language in OSN Arabic content. The

authors discuss and compare important proposed

techniques for studying this serious problem. They

concentrate on studying contributions mixing

between Natural Language Processing (NLP) and

machine learning models. After studying the state-

of-the-art in this field, the authors conclude that still

needs gaps and limitations to be treated. So, further

research effort has to develop novel benchmark

resources besides investigating more on the feature

extraction techniques and pre-processing.

ALBayari et al. study the problem of cyberbullying,

[14]. They show that most previous studies are

concentrating on the English language. They intend

to propose a review of classification methods used

to discover cyberbullying in Arabic texts. They

found gaps related to the few numbers of research in

that field in addition to the limitations linked to the

datasets themselves. Moreover, the majority of

proposed contributions to automatically detect

Arabic cyberbullying are based on Twitter, and most

of them are using the SVM classifier or CNN.

Fig. 2: Existing gaps and the goals of our paper

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2023.20.2

Fethi Fkih, Tarek Moulahi,

Abdulatif Alabdulatif

E-ISSN: 2224-3402

Volume 20, 2023

This literature review conducts us to summarize

gaps and limitations and link them to the goals of

our paper in Fig. 2.

3 Proposed Model

In this section, we make a general overview of

machine learning and the used classification

techniques. In addition, we outline how the dataset

was prepared.

3.1 Machine Learning Overview

Machine Learning (ML) is a subset of Artificial

Intelligence (AI) tools. AI is a simulation of human

intelligence. ML is linked to the use of probabilistic

mathematical formulas by machines to "learn" and

decide the output after exercising on a dataset of

inputs.

The basic ML steps are (1) Data collection, (2) Data

preparation, (3) Model training, (4) Model

evaluation and finally (5) Model Tuning.

There are essentially 4 types of ML models:

 Supervised Learning Models: working with a

labeled dataset like Neural Networks, SVM,

Decision Trees, and Naïve Bayes, [15]. A

supervised learning algorithm aims to model

connections and dependencies between the input

features and the target prediction output.

 Unsupervised Learning Models: which predict

outputs with no labels like Principal Component

Analysis (PCA), [16,17], for data reduction K-

means for clustering. These algorithms attempt

to employ techniques on the input data to find

patterns, aggregate and summarize the data

points, recognize patterns, and derive relevant

insights that help users understand the data

better.

 Semi-Supervised Learning Models: This is a

hybrid approach from the previous two types,

like Generative Adversarial Network (GAN).

These techniques take advantage of the fact that,

despite the unlabeled data's unknown group

memberships, this data contains crucial details

about the group parameters.

 Reinforcement Learning Models: is very close

to human learning based on driving the learning

process where a learner would work better, like

Q-learning. This technique tries to take

decisions that would maximize the reward or

minimize the risk utilizing observations

acquired from the interaction with the

environment. The agent, a reinforcement

learning algorithm, iteratively continually learns

from its surroundings, [18].

3.2 Execution Process

In our proposed model, the execution process is

performed essentially in three phases as shown in

Fig. 3:

Phase1:

Preparing the dataset by using a statistical approach

to create features describing the list of tweets,

comments, and posts. In addition to making a fair

distribution of classes to guarantee realistic behavior

and acceptable results.

Phase2:

Using 9 relevant machine and deep learning tools

for training and testing based on the previous dataset

with the aim of predicting tweets, comments, and

posts.

Phase3:

Compare and discuss the result of the technique

used previously. The comparison will be based on

precision, F1-score, and Recall. In case the result is

not satisfactory, go to phase 2.

Fig. 3: Phases of the proposed method

3.3 Features Preparation

In order to detect hate and offensive speech in

Arabic slang tweets, posts, and comments, we

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2023.20.2

Fethi Fkih, Tarek Moulahi,

Abdulatif Alabdulatif

E-ISSN: 2224-3402

Volume 20, 2023

propose Machine Learning methods using statistical

features, [19], [20]. This choice is motivated by

many reasons such as, it is very difficult to set

linguistic or semantic rules for modeling Arabic

slang since it doesn’t follow any grammatical rules.

Furthermore, the Slang language is very local even

in the same country. For all these mentioned

reasons, we choose to use a purely statistical

approach.

As we have mentioned, our approach used a set of

numerical features that will be integrated into

Machine Learning models. Table 1 presents the used

statistical features.

Table 1. Used features

Feature

Explanation

words_number

Number of words in the tweet

Char_number

Number of characters in the

CRLF

CR and LF are control

characters that are used to

mark a line break in the tweet

retweet_number

the number of retweets

Emoticons

An emoticon is a

representation of a human

facial expression using only

keyboard characters such as

letters, numbers, and

punctuation marks.

Emojis

An emoji is an image small

enough to insert into text that

expresses an emotion or idea

question_mark

interrogation_mark

dot_mark

Full stop.

Hashtag

The hashtag is used to

highlight keywords or topics

within a Tweet

URL

Uniform Resource Locator

4 Experimental Study

This section provides an experimental study for

evaluating our proposed model. In fact, the

extracted statistical features will be integrated into a

set of 9 well-known Machine Learning:

 Logistic Regression: it is a classification model

rather than a regression model mainly (despite its

name) used for binary and linear classification

problems, [21].

 Decision Tree: it provides a classification and

predictive model that can be easily graphically

presented. This model has the ability for handling

numerical and categorical data, [22].

 k-Nearest Neighbors (KNN): This learning

model stores all available data points (examples)

and classifies new data points based on similarity

measures, [23].

 Linear Discriminant Analysis (LDA): it is a very

common technique for dimensionality reduction

problems as a pre-processing step for machine

learning and pattern classification applications,

[24].

 Multinomial Naive Bayes: it predicts the tag of

an observation, such as a word or a frequency or

PoS, using the Bayes model. It calculates each

tag’s likelihood for a given observation and

provides the tag with the highest chance, [25].

 Gaussian Naive Bayes: Naive Bayes is a group of

supervised machine learning classification

algorithms based on the Bayes theorem. It is a

simple technique for constructing classifiers:

models that assign class labels to problem

instances, [26].

 Support Vector Machine (SVM): it is a

supervised machine learning model that can be

used for classification and regression problems.

However, the support vector machine is

mathematically complex and computationally

expensive, [27].

 Random Forest: Random Forest is a

computationally efficient technique that can

operate quickly over large datasets, [28].

 Neural Network: it is inspired by the

sophisticated functionality of human brains

where hundreds of billions of interconnected

neurons process information in parallel, [29].

4.1 Dataset Description

In this work, we use an Arabic slang dataset named

OSACT2020-shared Task. This dataset contains

6964 tweets, comments, and posts that are manually

annotated for both classes: offensiveness (labels are:

OFF or NOT_OFF) and hate speech (labels are: HS

or NOT_HS). All information about this dataset is

available in Table 2.

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2023.20.2

Fethi Fkih, Tarek Moulahi,

Abdulatif Alabdulatif

E-ISSN: 2224-3402

Volume 20, 2023

Table 2. Distribution of different classes in the

OSACT2020-dataset

Classes

Number

of tweets

Ratio

Example

OFF

1323

19%

    

  

!!

NOT_OFF

5641

81%

    

   



348

   

  



NOT_HS

6616

95%

    

😍 🌹

As shown in Table 2, the distribution of classes in

the dataset is imbalanced. For instance, the ratio of

HS class is 5% which is considered a minority in the

dataset. For this reason, we have handled this issue

using SMOTE algorithm, [30], which is an

oversampling technique where the synthetic samples

are generated for the minority class. For this

purpose, we separated the two classes (HS and OFF)

into two different features file. Then, we applied the

SMOTE algorithm on each file which provide the

following balanced distribution (Table 3):

Table 3. New classes distribution after applying

SMOTE algorithm

Classes

Number of

records

Ratio

Total

number

OFF

1974

50%

3948

NOT_OFF

1974

50%

2360

50%

4720

NOT_HS

2360

50%

4.2 Results

The classification results of the set of Machine

Learning models applied to the class HS/NOT_HS

are shown in Fig. 4 and Table 4. As we can notice,

Random Forest and Decision Tree models outscore

all the other models with an F1-score of 0.9 and

0.88, respectively. On the other hand, Gaussian

Naive Bayes and SVM models provide the worst

results with an F1-score of 0.61 and 0.69,

respectively.

In conclusion, a Precision and a Recall of 0.9 are

considered very good results due to the huge

linguistic challenges confronted when extracting

Hate speech from documents in Arabic Slang and

using only statistical features.

Table 4. Comparison between ML models for

HS/NOT_HS class. The best results are shown in

bold

ML Models

Precision

Recall

F1-

score

Logistic Regression

0.79

Decision Tree

0.88

KNN

0.84

0.83

Linear Discriminant

Analysis

0.78

Multinomial Naive

Bayes

0.76

0.74

Gaussian Naive Bayes

0.71

0.64

0.61

SVM

0.73

0.7

0.69

Random Forest

0.9

Neural Network

0.85

Regarding the class OFF/NOT_OFF, classification

results are very similar to the first class. As shown

in Fig. 5 and Table 5, Random Forest and Decision

Tree models are still the best and outscore all the

other in the threeevaluation metrics with an F1-

score of 0.75 and 0.72, respectively. Also, Gaussian

Naive Bayes and SVM provide the worst results for

this class with an F1-score of 0.53 and 0.61

respectively.

Fig. 4: Evaluation of ML models on HS/NOT_HS

class

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2023.20.2

Fethi Fkih, Tarek Moulahi,

Abdulatif Alabdulatif

E-ISSN: 2224-3402

Volume 20, 2023

The main interpretation that can be deduced from

these results is that the performance of the proposed

model decreased when handling offensive speech

versus hate speech. This result can be explained by

the fact that offensive speech is more complex and

difficult compared with hate speech and need more

sophisticated techniques than using simple statistical

features. Although statistical features are simple to

model and extract from textual content and make the

approach independent from any language, they are

incapable to recognize semantic relations and need

to be combined with semantic knowledge, such as

ontologies, [31], [32], [33].

Table 5. Comparison between ML models for

OFF/NOT_OFF class. The best results are shown in

bold

ML Models

Precision

Recall

F1-

score

Logistic Regression

0.71

Decision Tree

0.72

KNN

0.71

0.7

Linear Discriminant

Analysis

0.7

Multinomial Naive Bayes

0.7

0.69

Gaussian Naive Bayes

0.59

0.56

0.53

SVM

0.61

Random Forest

0.75

Neural Network

0.72

0.71

Fig. 5: Evaluation of ML models on HS/NOT_HS

class

5 Conclusion

The use of social media is becoming one of day

practical habits. As a source of news, ideas

exchanging and communications, it is also a source

of serious problems like messages of hate and

cyberbullying. Meanwhile, the Arabic language is

ranked in the 4th place of most commonly used

language in internet live content. In return, few

research works are addressing this problem in the

context of the Arabic language and insufficient

research works are dealing with this issue in Arabic

slang.

All of that motivates us to propose, in this paper, a

purely statistical approach for detecting and

predicting cyberbullying, hate speech, and offensive

tweets, comments, and posts. Our proposed methods

provided good results with a prepared Arabic slang

dataset. In fact, results show that our method

provided the optimal performance when using

Random Forests and Decision Trees as classification

models.

In future works, we plan to improve the detection

results by working more on the dataset. This can be

performed by integrating (Natural Language

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2023.20.2

Fethi Fkih, Tarek Moulahi,

Abdulatif Alabdulatif

E-ISSN: 2224-3402

Volume 20, 2023

Processing) NLP rules with the statistical approach

for data preparation.

Acknowledgement:

The researchers would like to thank the Deanship of

Scientific Research, Qassim University for funding

the publication of this project.

References:

[1] Statista, “Most common languages used on

the internet as of January 2020, by share of

internet users,” 2020. [Online]. Available:

https://www.statista.com/statistics/262946/sha

re-of-the-most-common-languages-on-the-

internet/

[2] Q. Huang, V. K. Singh, and P. K. Atrey,

“Cyber bullying detection using social and

textual analysis,” in Proceedings of the 3rd

International Workshop on Socially-aware

Multimedia, Orlando, Florida, USA, pp. 3–6,

2014.

[3] B. S. Nandhini and J. Sheeba, “Online social

network bullying detection using intelligence

techniques,” Procedia Computer Science, vol.

45, pp. 485–492, 2015.

[4] V. Nahar, S. Al-Maskari, X. Li, and C. Pang,

“Semi-supervised learning for cyberbullying

detection in social networks,” in Australasian

Database Conference, Brisbane, QLD,

Australia, pp. 160–171, Springer, 2014.

[5] P.-J. Lee, Y.-H. Hu, K. Chen, J. M. Tarn, and

L.-E. Cheng, “Cyberbullying detection on

social network services,” in PACIS 2018

Proceedings, Yokohama, Japan, vol. 61,

2018.

[6] M. Alotaibi, B. Alotaibi, and A. Razaque, “A

multichannel deep learning framework for

cyberbullying detection on social media,”

Electronics, vol. 10, no. 21, pp. 1–14, 2021.

[7] A. Akhter, U. K. Acharjee, and M. M. A.

Polash, “Cyber bullying detection and

classification using multinomial naïve bayes

and fuzzy logic,” Int. J. Math. Sci. Comput,

vol. 5, pp. 1–12, 2019.

[8] A. Ioannou, J. Blackburn, G. Stringhini, E. De

Cristofaro, N. Kourtellis, and M. Sirivianos,

“From risk factors to detection and

intervention: a practical proposal for future

work on cyberbullying,” Behaviour &

Information Technology, vol. 37, no. 3, pp.

258–266, 2018.

[9] B. Haidar, M. Chamoun, and A. Serhrouchni,

“A multilingual system for cyberbullying

detection: Arabic content detection using

machine learning,” Advances in Science,

Technology and Engineering Systems Journal,

vol. 2, no. 6, pp. 275–284, 2017.

[10] B. Haidar, M. Chamoun, and A. Serhrouchni,

“Multilingual cyberbullying detection system:

Detecting cyberbullying in arabic content,” in

2017 1st Cyber Security in Networking

Conference (CSNet), Rio de Janeiro, Brazil,

pp. 1–8, IEEE, 2017.

[11] H. Mohaouchane, A. Mourhir, and N. S.

Nikolov, “Detecting offensive language on

arabic social media using deep learning,” in

2019 Sixth International Conference on Social

Networks Analysis, management and security

(SNAMS), Granada, Spain, pp. 466–471,

IEEE, 2019.

[12] A. Omar, T. M. Mahmoud, and T. Abd-El-

Hafeez, “Comparative performance of

machine learning and deep learning

algorithms for arabic hate speech detection in

osns,” in The International Conference on

Artificial Intelligence and Computer Vision,

Cairo, Egypt, pp. 247–257, Springer, 2020.

[13] F. Husain and O. Uzuner, “A survey of

offensive language detection for the arabic

language,” ACM Transactions on Asian and

Low-Resource Language Information

Processing (TALLIP), vol. 20, no. 1, pp. 1–44,

2021.

[14] R. ALBayari, S. Abdullah, and S. A. Salloum,

“Cyberbullying classification methods for

arabic: A systematic review,” in The

International Conference on Artificial

Intelligence and Computer Vision, Settat,

Morocco, pp. 375–385, Springer, 2021.

[15] S. Zidi, T. Moulahi, and B. Alaya, “Fault

detection in wireless sensor networks through

svm classifier,” IEEE Sensors Journal, vol.

18, no. 1, pp. 340–347, 2017.

[16] T. Moulahi, “Joining formal concept analysis

to feature extraction for data pruning in cloud

of things,” The Computer Journal, pp. 1–9,

2021.

[17] T. Moulahi, S. El Khediri, R. U. Khan, and S.

Zidi, “A fog computing data reduce level to

enhance the cloud of things performance,”

International Journal of Communication

Systems, vol. 34, no. 9, pp. 1–13, 2021.

[18] A. Mchergui and T. Moulahi, “A novel deep

reinforcement learning based relay selection

for broadcasting in vehicular ad hoc

networks,” IEEE Access, vol. 10, pp. 112–

121, 2021.

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2023.20.2

Fethi Fkih, Tarek Moulahi,

Abdulatif Alabdulatif

E-ISSN: 2224-3402

Volume 20, 2023

[19] F. Fkih and M. N. Omri, “Information

retrieval from unstructured web text document

based on automatic learning of the threshold,”

International Journal of Information Retrieval

Research (IJIRR), vol. 2, no. 4, pp. 12–30,

2012.

[20] F. Fkih and M. N. Omri, “Hidden data states-

based complex terminology extraction from

textual web data model,” Applied Intelligence,

vol. 50, no. 6, pp. 1813–1831, 2020.

[21] A. Subasi, Practical Machine Learning for

Data Analysis Using Python. Academic Press,

2020. [Online]Available:

https://www.sciencedirect.com/book/9780128

213797/practical-machine-learning-for-data-

analysis-using-python

[22] V. Matzavela and E. Alepis, “Decision tree

learning through a predictive model for

student academic performance in intelligent

m-learning environments,” Computers and

Education: Artificial Intelligence, vol. 2, p.

100035, 2021.

[23] I. Saini, D. Singh, and A. Khosla, “Qrs

detection using k-nearest neighbor algorithm

(knn) and evaluation on standard ecg

databases,” Journal of Advanced Research,

vol. 4, no. 4, pp. 331–344, 2013.

[24] A. Tharwat, T. Gaber, A. Ibrahim, and A. E.

Hassanien, “Linear discriminant analysis: A

detailed tutorial,” AI Communications, vol.

30, no. 2, pp. 169–190, 2017.

[25] A. M. Kibriya, E. Frank, B. Pfahringer, and G.

Holmes, “Multinomial naive bayes for text

categorization revisited,” in Australasian

Joint Conference on Artificial Intelligence,

Canberra, ACT, Australia, pp. 488-499,

Springer, 2004.

[26] C. Bustamante, L. Garrido, and R. Soto,

“Comparing fuzzy naive bayes and gaussian

naive bayes for decision making in robocup

3d,” in Mexican International Conference on

Artificial Intelligence, Mexico City, Mexico,

pp. 237– 247, Springer, 2006.

[27] S. Suthaharan, “Machine learning models and

algorithms for big data classification,” Integr.

Ser. Inf. Syst, vol. 36, pp. 1–12, 2016.

[28] T. M. Oshiro, P. S. Perez, and J. A.

Baranauskas, “How many trees in a random

forest?”, in International Workshop on

Machine Learning and Data Mining in

Pattern Recognition, Berlin, Germany, pp.

154–168, Springer, 2012.

[29] S.-C. Wang, “Artificial neural network,” in

Interdisciplinary Computing in Java

Programming, pp. 81– 100, Springer, 2003.

[30] N. V. Chawla, K. W. Bowyer, L. O. Hall, and

W. P. Kegelmeyer, “Smote: synthetic

minority oversampling technique,” Journal of

Artificial Intelligence Research, vol. 16, pp.

321–357, 2002.

[31] F. Fkih and M. N. Omri, “Estimation of a

priori decision threshold for collocations

extraction: an empirical study,” International

Journal of Information Technology and Web

Engineering (IJITWE), vol. 8, no. 3, pp. 34–

49, 2013.

[32] F. Fkih and M. N. Omri, “Hybridization of an

index based on concept lattice with a

terminology extraction model for semantic

information retrieval guided by wordnet,” in

International Conference on Hybrid

Intelligent Systems, Marrakech, Morocco, pp.

144–152, Springer, 2016.

[33] F. Fkih, M. N. Omri, and I. Toumia, “A

linguistic model for terminology extraction

based conditional random field,” in:

Proceedings of the International Conference

on Computer Related Knowledge, ICCRK

2012, Sousse, Tunisia, pp. 38, 2012.

Contribution of Individual Authors to the

Creation of a Scientific Article (Ghostwriting

Policy)

Fethi Fkih wrote the original draft of the paper and

carried out the simulation, the formal analysis, and

the optimization.

Tarek Moulahi has defined the methodology, and

reviewed, and edited the paper.

Abdulatif AlAbdulatif was responsible for the

project administration and the funding acquisition.

Sources of Funding for Research Presented in a

Scientific Article or Scientific Article Itself

This work was funded by the Deanship of Scientific

Research, Qassim University, Saudi Arabia.

Creative Commons Attribution License 4.0

(Attribution 4.0 International, CC BY 4.0)

This article is published under the terms of the

Creative Commons Attribution License 4.0

https://creativecommons.org/licenses/by/4.0/deed.en

_US

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2023.20.2

Fethi Fkih, Tarek Moulahi,

Abdulatif Alabdulatif

E-ISSN: 2224-3402

Volume 20, 2023

Conflict of Interest

The authors have no conflicts of interest to declare

that are relevant to the content of this article.