AraQA-BERT: Towards an Arabic Question Answering System using

Pre-trained BERT Models

AFNAN H. ALSHEHRI

Faculty of Informatics and Computer Systems Department

College of Computer Science,

King Khalid University,

Abha,

SAUDI ARABIA

Abstract: - To increase performance, this study presents AraQA-BERT, an Arabic question-answering (QA)

system that makes use of pre-trained BERT models. The study emphasizes how important QA systems are for

promptly and accurately responding to user inquiries, especially when those inquiries are made in native

tongues. Arabic QA systems are necessary because of the complexity and linguistic variances of the Arabic

language, even if English QA systems have made substantial progress. The study examines the use of pre-

trained language models, such as AraBERT and Arabic-BERT, for Arabic QA tasks with a focus on Modern

Standard Arabic (MSA). The study's contributions include the creation of a web-based application named

AraQA-BERT for open-domain QA, trials on TyDi and ARCD datasets, and a methodology for employing pre-

trained models.

Key-Words: - AraQA-BERT, Modern Standard Arabic, Question-answering, QA systems, Linguistic

variations, Arabic-BERT, Arabic-BERT.

Received: August 9, 2023. Revised: May 17, 2024. Accepted: July 6, 2024. Published: August 8, 2024.

1 Introduction

Information retrieval (IR) approaches are used with

comprehension algorithms to obtain pertinent

sections or publications that address user inquiries.

The three steps of an IR-based QA system are

explained: question processing, passage retrieval,

and answer processing.

Ontologies like DBPedia or Freebase are used to

represent structured databases. The process involves

creating a semantic representation of the question

and using parsers or query languages to extract

answers. Rule-based and supervised methods are

discussed as approaches for implementing

knowledge-based QA systems.

Pre-trained language models (PTMs) are

introduced to enhance natural language processing

(NLP) tasks by leveraging knowledge from a pre-

trained model. The advantages of PTMs, such as

saving time and cost, are highlighted. BERT

(Bidirectional Encoder Representations from

Transformers) is mentioned as a significant pre-

trained language model for QA systems. The pre-

training and fine-tuning processes of BERT are

explained.

The paper addresses several challenges in

developing an Arabic QA system, including the

ambiguity of the Arabic language, the absence of

capital letters, and variations in dialects. However,

advancements in pre-trained models like AraBERT

and Arabic-BERT have shown promising results in

NLP downstream tasks, including QA systems.

Additionally, the availability of Typologically

Diverse Question Answering (TyDi QA) and Arabic

Reading Comprehension Dataset (ARCD) has

facilitated high-quality research on QA systems for

different languages.

The paper aims to build an effective and precise

Arabic QA model by fine-tuning the AraBERT and

Arabic-BERT models using the Fast-BERT library.

The performance is assessed using F1 and Exact

Match measures, and the TyDi and ARCD datasets

are employed for this purpose. The design and

implementation of the AraQA-BERT system, a

web-based tool that extracts precise answers from

the model's output, are also presented in the study.

Usability evaluation is performed to explore the

extent to which the suggested technique meets the

user experience.

2 Related Works

In [1], proposed WebShodh to do the job of taking

questions in telegraphic languages varying from

Hindi to Telugu transforming them into English and

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2024.21.34

Afnan H. Alshehri

E-ISSN: 2224-3402

361

Volume 21, 2024

extracting the answers. WebShodh is describing the

whole system for factoid questions. The system

works on a Google Search API for relevant result

searches. The MRR of the Hindi evaluation was

found to be 0.37 while for Telugu it was 0.32.

As in [2] are the magicians, they presented the

"Ask Your Neuron" protocol where visual and

textual questions are the inputs to generate the

answers. The system's performance was appraised

using the scientifically approved DAUQAR and

VQA datasets. This proposed approach was found to

outperform the existing methods in terms of

numeracy and recall which tend to give higher

accuracy and recall by the textual questions.

In [3], constructed a neural network encoder-

decoder, NQG++, which not only takes natural

language as its input and accomplishes the task of

harnessing its questions but also performs a

decoding function on their answers. The

performance of the proposed approach had been

evaluated on a dataset called the Stanford Question

Answer Dataset (SQuAD). Our proposed model

NQG++ has got better scores, achieving a higher

mean score of Neural Question Generation than the

old PCFG-Trans model in the baseline.

Also, researchers have been looking into

designing more automatic QA systems that could be

used to navigate through the web in recent years. In

[4] have got a system for basketeer for QA systems.

The method was taking user questions in natural

language, extracting various attributes to build the

answer, and comparing the questions with items

already present in the knowledge network. Named

Entity Recognition (NER) self-training achieved an

accuracy of 95.96%, and the system's overall test

dataset accuracy was 93%.

English QA systems are using recent advances

in pre-trained language models, including BERT

[5]. To normalize answer scores and enhance

performance on benchmarks such as Quasar-T,

SearchQA, TriviaQA, and OpenSQuAD, in [6]

employed BERT in numerous passages. Researchers

in [7] similarly improved the BERT model by

employing a community-based QA approach, and

they were able to obtain a Mean Square Average of

0.046.

Research studies on English QA systems have

incorporated several important datasets. In [8],

authors assessed their system with the DAUQAR

and VQA datasets, which have a lot of images and

questions associated with them. In [9], made use of

the Wikidata and DBpedia datasets. The DBpedia

dataset is taken from a question-and-answer dataset

based on the encyclopedic source Wikipedia and

Wikidata is an open-source structured data where a

considerable number of resources is included,

Stanford Question and Answer Dataset (SQuAD),

which naturally has a large dataset with human-

made questions and answers in articles written on

Wikipedia.

Adequate processing of questions is one of the

principal components of such systems, which is to

identify and classify user questions for

corresponding answers. In this case, the "question

understanding process" aims at distinguishing types

of questions and meanings to eradicate ambiguity,

and properly answer the question. ML techniques

can function both in a way that answers questions

are moved into processing by manners or through

automated ways. Arabic QA systems question

processing has been the center of several review

studies although.

As stimulated in [10], there was a development

of an Arabic-based Language Analyzer utilizing the

NooJ system. Particularly, this project directly

involved the difficult "why" and "what-is" questions

in the Arabic process. The proposed approach

presents important results, as shown by a recall

value of 68% and mean reciprocal rank (MRR) of

0.62 which yields higher accuracy than the results of

the control group.

Relying on the theories of [11], the researchers

concentrated on answering the “why” and “how”

questions in Arabic. The RST-based algorithm

allowed the machine to produce texts according to

the rules and the RST approach was applied. The

performance of LEMAZA is as anticipated,

sweeping the floor with already existing methods

(Precision, 79.2%; Recall, 72.7%).

In [12], presented temporal problem-solving for

Arabic QA systems that they strive to provide exact

answers to questions. Extraction of verbs and named

entities from the text and a corpus consisting of

temporal sentences and part-of-speech

representations were produced. The NLP

mechanism fed the system with temporal pattern

features that enabled it to compare the questions and

answers to conclude. The constructed model seemed

well suited to locate suitable solutions to time

questions.

To build an Arabic Question-and-answering

system, in [13] have applied planting lexical

knowledge and knowledge rules in this procedure.

They categorized time and position nouns and

numbered entities by way of the morphological

analysis. The system relied on NooJ's grammatical

tapes and data-detection technology for answer

recognition and conversion to class. The combined

technique proved to be effective as the experimental

results showed good precision and F-measure.

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2024.21.34

Afnan H. Alshehri

E-ISSN: 2224-3402

362

Volume 21, 2024

Study from [14] concentrated on Arabic QA

systems and messages' question processing and

answer pattern development. They produced

responses by the time, focus, and topic categories

they used to classify the questions. The precision,

recall, and F-measure of the experiment were all

good.

The study [15] developed an Arabic quality

assurance system that contained answers and used

natural language processing to extract passages. The

system that was used for the identification of the

queries and the generation of the replies was based

on the dependency parsing and the named entity

recognition. There was a good performance of the

experiments with reasonable recall and precision.

In [16], the author introduced an Arabic quality

assurance system based on pattern matching and

semantic analysis in another work. The algorithm

identified relevant segments from the questions

through the use of a semantic analyzer. We

employed pattern matching algorithms on the

passages to search for the responses and recognize

them. The results of the conducted experiment

showed how effective the proposed approach is in

Arabic QA systems.

For Arabic QA systems, researchers proposed a

transfer learning-based method [17]. In the

procedure of improving the efficiency of QA

systems, large-scale datasets with pre-trained

models were applied. The results of the trial

demonstrated how transfer learning is done in

Arabic QA systems; subsequently, the performance

and accuracy improvements were realized.

In [18] explored the authors questioning as to

how the BERT model could be enhanced for Arabic

QA systems. They employed Arabic datasets from

Quora and Stack overflow for enhancing the BERT

model. The results of the trial were MSA of 0. 046,

demonstrated enhanced performance.

Each of the research studies presented in this

section contributed to the development of the

guidelines or conventions for design and processing

of questions. Based on four criteria, Table 1

compiles studies pertinent to a rules-based

approach: The “Target source” denotes the name

and the type of the analyzed data; the “Question

analysis techniques correspond to the techniques

used by the system under consideration as well as

the “Question classification techniques correspond

to the classification techniques adopted; and finally,

the “Performance” section provides some of the

experimental results obtained.

Authors in [20] presented an Arabic QA system

with ontological development. They constructed an

ontology for the pathology domain, establishing

relationships between diseases, organs, reasons, and

symptoms.

Table 1. Certain Arabic QAS rule-based features

and techniques

System

Target

source

Question

analysis

techniques

Question

classificatio

techniques

System

Performa

nce

QARAB

[18]

Al-Raya

Arabic

newspap

er text

NLP

technique

and IR

Using a set

of known

question

type

Precision:

97.3%.

Recall:

97.3%

QASAL

[19]

Arabic

Docume

nts

NLP

technique

Defined

question

types and

forms using

NooJ local

grammar

Not

mentioned

A Discourse-

Based

Approach for

Arabic

Question

Answering

[10]

Arabic

corpus

belongin

g to the

Health,

Science

Technol

ogy

categorie

NLP

technique

Defined

"why" and

"how"

questions

recall of

68% and

MRR of

0.62

LEMAZA

[11]

Open-

Source

Arabic

Corpora

(OSAC)

Rhetorical

Structure

Theory

(RST)

based

algorithm

Defined

"why" type

of questions

Precision

is 79.2%,

and recall

is 72.7%

A Rules-

based

Approach for

Arabic

Temporal

Expression

Extraction

[13]

Pilot

Arabic

Propban

k data

Morphologi

cal analysis

using NooJ

local

grammar

An F-

measure

score of

95.5%

Generating

Answering

Patterns

from Factoid

Arabic

Questions

[14]

Not

mentione

Morphologi

cal

Analysis of

Factoid

Arabic

Questions

using NooJ

local

grammar

Using a set

of known

question

type

75%

precision,

72%

recall, and

an F-

measure

of 73%

The dataset was manually generated using the

Protégé tool for precise and relevant answer

formation.

Basic NLP techniques such as tokenization,

stemming, stop word removal, and lemmatization

were applied to process incoming questions. The

proposed methodology achieved a high precision of

81%, recall of 93%, and an F-measure of 86%,

outperforming existing approaches.

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2024.21.34

Afnan H. Alshehri

E-ISSN: 2224-3402

363

Volume 21, 2024

In [21], developed a Community Question-

Answer (CQA) model for Arabic using the

SemEval-2017 Task 3 dataset. They coupled term-

based and word2vec similarity measures with the

QU-BIGR system similarity measure, and they used

the Support Vector Machine (SVM) for data

training. In a competition, the suggested method

placed second out of all the systems in use,

illustrating its usefulness for answering optimization

in CQA systems.

In [22], researchers addressed the CQA issue

and concentrated on question classification in

Arabic QA systems. They used neural networks and

supervised machine learning algorithms (SVM,

logistic regression, and random forest) to categorize

questions. SPLIT and MADAMIRA were utilized to

encode the input data from the SEMEval2017 CQA

dataset. Lexical and semantic features were the two

feature types that were employed. The SVM method

achieved a Mean Average Precision (MAP) of

62.85%, outperforming other algorithms. In the

experiments, mean reciprocal rank (MRR) and

average accuracy (AvgRec) were taken into

consideration as evaluation criteria.

In [23], proposal included the Unstructured

Information Management Architecture (UIMA) and

a community-based Arabic QA system developed

using neural network models. The Farasa NLP tool

served as the foundation for the UIMA Arabic NLP

architecture, and the system's performance was

assessed using Mean Average Precision (MAP),

Average Precision (AvgP), and P(k). The outcomes

demonstrated encouraging MAP values without

testing-related trimming.

In [24], presented an approach for question

classification using SVM and Convolutional Neural

Networks (CNN). The coarse class generated by

SVM was used as input for neural networks to

predict the subclass. The approach achieved an

accuracy of 82% and included six rough classes:

abbreviation, entity, description, human, location,

and numeric value. The fine-grained classification

was challenging due to the similarity among fine

classes and the difficulty training instances within

each fine class.

In [25], introduced a new Arabic taxonomy for

QA systems using an ML approach. They collected

data from various datasets and trained an SVM

classifier. The proposed methodology achieved an

accuracy of 90%, indicating its enhancement of

existing QA systems.

In [26], presented the Visual QA System, which

utilized bidirectional recurrent neural networks

(BiSRU and BiLSTM) for feature extraction. The

bidirectional approach improved answer prediction

by considering historical and future contextual

information. The Stacked Attention Model (SAM)

and EnSAN were adopted for query attention

modeling. Experimental results on the COCO-QA

dataset showed an optimal accuracy of 2.2%.

These studies demonstrate the effectiveness of

ML approaches, with the SVM classifier commonly

used for accurate question classification. Neural

networks have also gained popularity and have

shown significant results in question categorization

using ML techniques in Table 2.

Table 2. Some Arabic QAS machine learning

features and techniques

System

Target

source

Question

analysis

techniqu

Question

classificati

techniques

System

Performan

Arabic

Question

Answering

Using

Ontology

[21]

Their own

Corpora,

comprising

100

Questions

NLP

technique

Using a set

of known

questions

Accuracy

of 81%,

recall of

93%, and

an F-

Measure

score of

86%

QU-BIGR

[22]

Arabic

collection of

questions

and their

potentially

question-

answer pairs

NLP

technique

SVM

classifier

MAP score

of 43.41,

F1 score of

62.57

Arabic

Question

Classificati

on Using

Support

Vector

Machines

and

Convolutio

nal Neural

Networks

[25]

TALA-

AFAQ

dataset

NLP

technique

SVM and

CNN

classifier

The

accuracy of

the

proposed

architecture

is 82%

Community

Question

Answering

(CQA) for

the Arabic

language

[23]

SEMEval20

17 CQA

dataset

NLP

technique

SVM,

logistic

regression,

and random

forest

classifier

SVM

outperform

s other

algorithms

as it

generates

62.85%

MAP

Visual

Question

Answer

System

Based on

Bidirectiona

l Recurrent

Networks

[25]

COCO-QA

Dataset

NLP

technique

Bidirection

al recurrent

neural

networks

(BiSRU

and

BiLSTM)

Best results

for a

COCO-QA

accuracy of

2.2%

The lack of large pre-training corpora has been a

challenge for Arabic language processing. In [23],

addressed this issue by training four Arabic-BERT

language models from scratch. These models,

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2024.21.34

Afnan H. Alshehri

E-ISSN: 2224-3402

364

Volume 21, 2024

named Mini, Medium, Base, and Large, were pre-

trained on approximately 8.2 billion words from the

OSCAR Arabic version, Arabic Wikipedia dumps,

and other Arabic resources. The models were

trained using Google BERT's repository with

modifications to the training settings. These pre-

trained models have been made publicly available

and have contributed to enhancing Arabic NLP

applications.

Another significant development is the

AraBERT model. AraBERT is a transformer-based

model for Arabic that provides a language

representation based on the BERT model. It was

trained using the BERT base configuration and 70

million sentences from large Arabic corpora. The

model was fine-tuned for downstream NLP tasks,

including question answering (QA). AraBERT

achieved state-of-the-art performance in various

Arabic NLP tasks and is publicly available for

research and applications. Furthermore, BERT has

been used for open-domain factoid Arabic QA

systems, [27]. Their approach combined a document

retriever using hierarchical TF-IDF and BERT for

answer extraction, achieving superior results

compared to existing methods.

Additionally in [28], introduced multilingual

BERT, which includes 104 languages. This

development has contributed to the success of NLP

applications and led to the creation of language-

specific BERT models.

These studies demonstrate using sizeable pre-

trained language models, such as BERT, for various

Arabic NLP tasks, including QA. These models

have significantly advanced the field and provided

researchers with powerful tools for Arabic language

processing. In [29], contributed to passage

extraction by formulating queries using POS tagging

and stemmed words, followed by passage extraction

from relevant documents using an Arabic Wikipedia

dataset. Their approach showed promising results in

terms of query-passage similarity.

In [30], presented the dataset for the Arabic

Why Question Answer System (DAWQAS),

consisting of 3205 "why" questions. The dataset was

constructed through several steps: obtaining data,

data preprocessing, identification of the same type

of questions, and probability-based labeling.

DAWQAS was trying to cover the absence of

datasets that explain "why" questions in the Arabic

language since no other datasets existed.

In [31], proposed the QA model TyDi which

consists of 922 pairs of questions and answers in

Arabic. TyDI QA covers 11 languages that are

typologically different and is designed to expose

linguistic characteristics as well as to help

reconstruct the data for various types of speech

using data scenarios and systems. Table 3 is an

example of the accessible datasets of the QA in the

Arabic language.

Table 3. Some Available Arabic QAS Datasets

Dataset

Source

Formulation

Size

Arabic-SQuAD

[29]

Translated

SQuAD

Paragraph,

Question

and

Answer

48,344

ARCD

[30]

Arabic

Wikipedia

Paragraph,

Question

and

Answer

1,395

ArabiQA

[32]

Wikipedia

Question

and

Answer

200

DefArabicQA

[33]

Wikipedia and

Google search

engine

Question

and

Answer with

documents

Translated TREC

and CLEF

[34]

Translated

TREC and

CLEF

Question

and

Answer

2,264

QAM4MRE

[35]

Selected topics

Document

Question and

multiple answers

160

DAWQUAS

[30]

Auto-

generated

from a web

scrape

Question

and

Answer

3205

QArabPro

[36]

Wikipedia

Question

and

Answer

335

3 Methodology

The choice of the data set TyDi QA [31] has been

made for two main reasons: (1) the dataset supports

a number of languages and (2) the data was gathered

using a realistic approach. The collection comprises

11 genetically not related languages, such as Arabic,

with the number of QA pairs being equal to 200k

and a pack of 922 pairs developed for the Arabic

language.

Besides posing questions, people using TyDi ask

them without any knowledge, and as a result, the

dataset applies perfectly to our project. The dataset

is collected without translation, ensuring

authenticity and language-specific characteristics.

The Arabic Reading Comprehension Dataset

(ARCD) [27] is chosen for system evaluation. The

ARCD consists of 1,395 questions written by

proficient Arabic speakers. It offers answers to

diversity and questions reasoning categories of a

comprehensive assessment.

The methodology involves the following steps:

 Data Pre-processing: The input QA datasets

(TyDi and ARCD) undergo cleaning and

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2024.21.34

Afnan H. Alshehri

E-ISSN: 2224-3402

365

Volume 21, 2024

processing to prepare the data for training and

testing.

 Dataset Splitting: The datasets are divided into

training and testing sets using the random state

in train_test_split.

 Fine-tuning: The pre-trained models are fine-

tuned for the QA system task, enabling the

models to learn specific textual properties

relevant to Arabic QA.

 Answer Selection: The model is trained to select

a span of text containing the answer given a

question and passage. This is achieved by

predicting 'start' and 'end' tokens that mark the

answer span.

 Passage Retrieval: The system retrieves the

passage containing the exact answer based on

the predicted answer span.

 Evaluation: The results of the QA system are

evaluated using accuracy, precision, F-measure,

and Exact Match metrics. A benchmark

approach, such as the TF-IDF reader, is used for

comparison.

The high-level architecture of the model is

illustrated in Figure 1.

 The fine-tuning process involves training the

model to predict the 'start' and 'end' tokens in the

passage, followed by passage retrieval and

answer classification.

 The evaluation metrics are used to assess the

performance of the proposed approach

compared to the benchmark method.

Software Tools

Python, a high-level, general-purpose, and

interpreted programming language, is used for

experiments and evaluation. Python provides

various tools and a broad standard library suitable

for multiple tasks. Project Jupyter is used as the

execution environment for running the Python code.

It supports interactive computing in various

programming languages and is widely adopted by

cloud providers.

Collaboratory, or Colab, is a free Jupyter

notebook environment running in the cloud and

storing notebooks on Google Drive.

 The proposed approach is evaluated using four

metrics: precision, recall, F-measure, and Exact

Match.

 Precision measures the accuracy of the system

in identifying correct answers.

 Recall determines the coverage of the system in

retrieving relevant answers.

 The F-measure combines precision and recall

into a single metric.

 Exact Match measures the percentage of

predictions that exactly match the ground truth

answer.

The experiments evaluate the pre-trained

AraBERT and Arabic-BERT models using different

train-test splits and epochs of the training dataset. A

baseline comparison is performed using a TF-IDF

reader based on 4 grams, a popular method for

feature extraction.

Two sets of experiments are conducted:

1. Different train-test splits (80/20 and 50/50)

and 15 epochs of training.

2. Other train-test splits (80/20 and 50/50) and

five epochs of training.

Each dataset (TyDi and ARCD) is tested

separately, and then the datasets are merged for

further evaluation. The Fast-BERT model,

supporting QA tasks, is used in all experiments. The

results of each model and the TF-IDF reader

baseline are reported and compared in Table 4.

Fig. 1: Question Answering Environment Model

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2024.21.34

Afnan H. Alshehri

E-ISSN: 2224-3402

366

Volume 21, 2024

Table 4. Comparison of the different reader modules results for the TyDi and ARCD datasets, alongside the

merged dataset's training on 15 epochs. For all evaluation processes, time was measured using the minutes unit

Table 5. Comparison of the different reader modules results for the TyDi and ARCD datasets, alongside the

merged dataset's training on five epochs. For all evaluation processes, time was measured using the minutes

unit

Table 6. Comparison of the different reader modules results for the TyDi and ARCD datasets, alongside the

merged dataset's training on 15 epochs. For all evaluation processes, time was measured using the minutes unit

Table 7. Comparison of the different reader modules results for the TyDi and ARCD datasets, alongside the

merged dataset's training on five epochs. For all evaluation processes, time was measured using the minutes

unit

Table 5 reports all of the model results for the

previously mentioned datasets in experiment-1, with

different training of the models using the training set

for five epochs with the same learning rate.

We split the datasets into 50-50% train/test. The

TyDi training set contained 460 pairs, and the

testing set included 461 questions and answers.

Subsequently, we used the ARCD, split into a

training set containing 693 teams and a testing set

containing 702 pairs of questions and answers.

Lastly, we merged both previous datasets, ARCD +

TyDi, splitting them into a training set containing

1152 pairs and a testing set containing 1163

questions and answers. We trained the models using

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2024.21.34

Afnan H. Alshehri

E-ISSN: 2224-3402

367

Volume 21, 2024

a training set for 15 epochs, with a learning rate of

2e-5. Table 6 reports all of the model results along

with the TF-IDF reader baseline, which evaluates

the testing sets for each previously mentioned

dataset.

Table 7 reports all of the model results for the

previously mentioned datasets, with different

training of the models using the training set for five

epochs with the same learning rate.

The results of the experiments demonstrate

significant improvements over the baseline TF-IDF

Reader. The Arabic-BERT Large model consistently

achieved the highest accuracy in all cases, while the

Arabic-BERT Mini model showed lower accuracy

but faster evaluation times.

Regarding specific datasets, the TyDi dataset

achieved the most accurate results, with an F1 score

of 71.6 and an Exact Match (EM) score of 57.8.

However, the evaluation process for this dataset was

time-consuming. On the other hand, the ARCD

dataset had lower accuracy, with an F1 score of 23.5

and an EM score of 4.7.

It was observed that AraBERTv0.1, which does

not require pre-segmentation, achieved better

accuracy with an F1 score of 63.7 and an EM score

of 50.8 compared to AraBERTv1 in all cases while

also having shorter evaluation times.

The improved performance of the pre-trained

models can be attributed to several factors. The data

size of Arabic-BERT Large played a crucial role, as

training with a larger dataset led to more accurate

results. Conversely, the smaller data size of Arabic-

BERT Mini resulted in lower accuracy. The larger

data size also provided more diversity in the pre-

training distribution, contributing to better

performance. Additionally, pre-segmentation before

BERT tokenization, as in AraBERTv1, reduced the

effectiveness of the QA task, while omitting pre-

segmentation, as in AraBERTv0.1, yielded more

accurate results.

4 Results

4.1 AraQA-BERT System

Given the lack of Arabic open-domain QA systems,

we implemented the AraQA-BERT system,

considered an available domain QA system for the

Arabic language. It is comprised of three module

components: 1) a document retriever (TF_IDF) for

obtaining relevant documents from Arabic

Wikipedia sources, 2) a neural reading

comprehension pre-trained AraBERT module for

extracting answers from retrieved documents,

alongside 3) an answer ranking module for ranking

the answers in order of relevance, based on taking in

scores from the document retriever and AraBERT

reader. Figure 2 presents our proposed system

architecture.

Fig. 2: Overview of the Architecture for our

Question Answering system

4.1.1 TF-IDF Document Retriever

A practical document retrieval method following

classical QA systems was adopted to initially

narrow our search space, concentrating on reading

only articles with a likelihood of being relevant. The

module aims to select the most pertinent documents

from Arabic Wikipedia for the question, thus

limiting our reader's search span.

Each document is tokenized and stemmed using

Arabic NLTK, where stop words are removed, with

the vector of each document then normalized.

Subsequently, the TF-IDF uses the term vector

model to represent documents and questions as

vector weights of identifiers, and articles and

questions are compared with the TF-IDF weighted

bag-of-word vectors. Afterward, each document is

scored using cosine similarity, which measures the

similarity between two non-zero vectors (Question

and paper).

Finally, the document retriever returns the top

five relevant Arabic Wikipedia articles with the

highest similarity to any question. These articles are

then passed to the document reader for processing

and answer extraction.

4.1.2 AraBERT Document Reader

Our proposed reader module is AraBERTv0.1, one

of two versions of the AraBERT model, attaining an

F1 score of 63.41 and 50.81% of exact match

accuracy on a TyDi dataset. We selected it for our

proposed system based on its high accuracy and

acceptable evaluation time that gives the extracted

answer 30 seconds to 1 minute compared to the

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2024.21.34

Afnan H. Alshehri

E-ISSN: 2224-3402

368

Volume 21, 2024

Arabic-BERT large, which takes 5 to 7 minutes to

answer the question.

The input text was initially tokenized using

SentencePiece (an unsupervised text tokenizer and

detokenizer) consisting of 64k tokens, which are

then embedded. Each question and paragraph pair

input point is represented as a single sentence

separated by a special pass. For each permit I in the

section, we compute the probability that i is the

answer's start or end. Afterward, we predict the span

(i, j), where i is the token with the highest

probability of being the answer's start token. In

contrast, j is the token with the highest likelihood of

being the answer's end token.

4.1.3 Answer Ranking

Alongside the top retrieved documents from step 1,

we obtained a score from the TF-IDF retriever per

document, considering cosine similarities between

the paper and question. We got a score per candidate

answer per retrieved document that passed as an

input to the document reader.

Following normalization, we passed them

through the softmax function to ensure that the

answer and document scores were on the same

scale, providing us with outputs and a vector

representing the probability distributions for a list of

outcomes.

4.2 AraQA-BERT User-Interface Design

The designed web-based tool interfaces for our

AraQA-BERT open domain q QA system are

presented in Figure 3 and Figure 4.

Fig. 3: Overview of our AraQA-BERT Home page

Fig. 4: Overview of AraQA-BERT Features page

4.3 Evaluation and Result

Our primary focus was previously on evaluating

pre-trained models AraBERT and Arabic-BERT.

Due to the time constraint, we evaluated our system

by conducting simple and not extended user

evaluation sessions.

The evaluation was conducted on five

participants who tried the web system tool. The

description of the evaluation method used is as

follows:

a) Participants characteristics: users from

different educational environments, and their ages

ranged from 20 to 35 years. Most of the participants

have experience in using website tools. They are

selected to test the AraQA-BERT system to ensure

that it achieves the objectives of the system.

b) Tasks: after explaining the general idea of

the system, tasks were given to the participants as a

list of steps to be carried out one by one as follows:

1) Enter five different questions in the Question

textbox '.ﻝﺍﺅﺱﻝﺍ'

2) Click on the button 'ﻝﺍﺱﺭﺇ '

3) Analyze whether the extracted answer is true

or not.

c) Testing Sessions: The experiment was

conducted at the exact location. We set the "AraQA-

BERT" system for each user, read the tasks for

them, and offer assistance to the participants if they

counter a problem. Finally, the observers

interviewed the participants after the session to ask

them, "How many correct answers did they get

during the experiment?" and "How did they find use

of the system in general". The time set for

experiencing the "AraQA-BERT" system is 10

minutes unless the participants need more time or

ask for expansion. In general, the system test took

one hour to complete.

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2024.21.34

Afnan H. Alshehri

E-ISSN: 2224-3402

369

Volume 21, 2024

The result of our post-session interview shows

that out of 25 questions tried by users, seven

questions correctly retrieved their answers. During

the testing sessions, we observed that the system

could correctly retrieve related paragraphs

containing answers to the question. Still, the issue is

not scoring the correct paragraph highly enough.

5 Discussion

The results reported in all of the above tables show

significant improvement in the results for all of the

modules, over the baseline TF_IDF Reader. Arabic-

BERT Large was the most accurate model in all

cases. Table 4 evidenced that the TyDi dataset

attained a 71.6 F1 and 57.8 EM, which were the

most accurate results the model attained across all

experiments. However, its shortcoming is that the

evaluation process was time-consuming.

Meanwhile, the Arabic-BERT Mini was the less

accurate model in all instances, albeit with less

evaluation time. As presented in Table 7, the ARCD

dataset attained a 23.5 F1 and 4.7 EM, thus being

the less accurate results across all experiments.

Furthermore, we observed that AraBERTv0.1

necessitates no pre-segmentation, achieving

advanced accuracy results with 63.7 F1 and 50.8

using the TyDi dataset over AraBERTv1 in all

cases, during close evaluation periods.

This boost in performance for the pre-trained

models has numerous explanations. For example,

the data size of the Arabic-BERT Large is a clear

factor in enhancing performance, because training

with a higher data size provides more accurate

results. Contrastingly, smaller data size offers less

accurate results, as apparent with the Arabic-BERT

mini. With the large data size, the pre-training

distribution is characterized by greater diversity.

Concerning the pre-segmentation applied before

BERT, tokenization reduced the QA task

performance efficacy, as with AraBERTv1,

although with no pre-segmentation applied more

accurate results were found, as with AraBERTv0.1.

We believe that these factors facilitated the

proposed models reaching the state-of-the-art level

for the QA task.

6 Conclusion

In conclusion, this work focused on developing an

Arabic QA model using pre-trained AraBERT and

Arabic-BERT models. The experiments on multiple

datasets demonstrated significant improvements

over the TF-IDF reader baseline, with Arabic-BERT

Large achieving the highest accuracy.

AraBERTv0.1, which does not require pre-

segmentation, showed better results than

AraBERTv1.

Furthermore, an open domain QA system was

built using Arabic Wikipedia as a knowledge

source, consisting of three main modules. The

system achieved an F1 score of 63.41 and 50.81%

exact match accuracy on the TyDi dataset, with

acceptable evaluation times. A user evaluation

session with 5 participants and 25 questions showed

that seven were correctly answered.

Future work will expand the experiments by

using different datasets and exploring different

hyperparameters to improve the performance of the

pre-trained models. Additionally, efforts will be

made to enhance the system's paragraph selection

and answer extraction capabilities.

References:

[1] Jones, Gareth JF, Seamus Lawless, Julio

Gonzalo, Liadh Kelly, Lorraine Goeuriot,

Thomas Mandl, Linda Capellato, and Nicola

Ferro. "Experimental IR meets

multilinguality, multimodality, and

interaction. In Proceedings of the Eighth

International Conference of the CLEF

Association, Dublin, Ireland, September 11–

14, 2017. Lecture Notes in Computer Science

(LNCS) (Vol. 10456)." (2023, May).

https://doi.org/10.1007/978-3-319-65813-1.

[2] Malinowski, Mateusz, Marcus Rohrbach, and

Mario Fritz, “Ask Your Neurons: A Deep

Learning Approach to Visual Question

Answering.” International Journal of

Computer Vision, 125 (1–3), 2017: 110–35.

https://doi.org/10.1007/s11263-017-1038-2.

[3] Zhou, Qingyu, Nan Yang, Furu Wei, Chuanqi

Tan, Hangbo Bao, and Ming Zhou. "Neural

Question Generation from Text: A

Preliminary Study." Lecture Notes in

Computer Science (Including Subseries

Lecture Notes in Artificial Intelligence and

Lecture Notes in Bioinformatics), 2018 10619

LNAI: 662–71, (2023, April).

https://doi.org/10.1007/978-3-319-73618-

1_56.

[4] Li, Ying, Jie Cao, and Yongbin Wang.

"Implementation of intelligent question

answering system based on basketball

knowledge graph." In 2019 IEEE 4th

Advanced Information Technology, Electronic

and Automation Control Conference (IAEAC),

Chengdu, China, pp. 2601-2604. IEEE, 2019,

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2024.21.34

Afnan H. Alshehri

E-ISSN: 2224-3402

370

Volume 21, 2024

(2023, May).

https://doi.org/10.1109/iaeac47372.2019.8997

747.

[5] Kenton, Jacob Devlin Ming-Wei Chang, and

Lee Kristina Toutanova. "Bert: Pre-training of

deep bidirectional transformers for language

understanding." In Proceedings of naacL-

HLT, vol. 1, p. 2. (2019, Minnesota), (2023,

April). http://arxiv.org/abs/1810.04805.

[6] Wang, Zhiguo, Patrick Ng, Xiaofei Ma,

Ramesh Nallapati, and Bing Xiang, "Multi-

Passage BERT: A Globally Normalized

BERT Model for Open-Domain Question

Answering." In Proceedings of the 2019

Conference on Empirical Methods in Natural

Language Processing and the 9th

International Joint Conference on Natural

Language Processing (EMNLP-IJCNLP),

5878–5882. Hong Kong, China: Association

for Computational Linguistics, 2019, (2023,

May). https://doi.org/10.18653/v1/D19-1599.

[7] Annamoradnejad, Issa, Mohammadamin

Fazli, and Jafar Habibi. "Predicting subjective

features from questions on QA websites using

BERT." In 2020 6th international conference

on web research (ICWR), Tehran, Iran, pp.

240-244. IEEE, 2020, (2023, May).

http://arxiv.org/abs/2002.10107.

[8] Hashemi Hosseinabad, Sayedshayan, Mehran

Safayani, and Abdolreza Mirzaei. "Multiple

answers to a question: a new approach for

visual question answering." The Visual

Computer, 37, no. 1 (2021): 119-131.

[9] Diefenbach, Dennis, Kamal Singh, and Pierre

Maret. "WDAqua-Core0: A Question

Answering Component for the Research

Community." Communications in Computer

and Information Science, 769: 84–89, 2017,

https://doi.org/10.1007/978-3-319-69146-6_8.

[10] Sadek, Jawad, and Farid Mezian. "A

Discourse-Based Approach for Arabic

Question Answering." ACM Transactions on

Asian and Low-Resource Language

Information Processing, 16 (2): 1–18, 2016.

https://doi.org/10.1145/2988238.

[11] Azmi, Aqil M., and Nouf A. Alshenaif,

"Lemaza: An Arabic Why-Question

Answering System." Natural Language

Engineering, 23 (6): 877–903, 2017.

https://doi.org/10.1017/S1351324917000304.

[12] Omri, Hajer, Zeineb Neji, Marieme Ellouze,

and Lamia Hadrich Belguith. 2017. "The Role

of Temporal Inferences in Understanding

Arabic Text." Procedia Computer Science,

112: 195–204.

https://doi.org/10.1016/j.procs.2017.08.228.

[13] Lhioui, Chahira, Anis Zouaghi, and Mounir

Zrigui. "A Rule-Based Approach for Arabic

Temporal Expression Extraction."

Proceedings - 2017 International Conference

on Engineering and MIS, ICEMIS 2017,

Janua: 1–6, 2018, Monastir, Tunisia, (2023,

May).

https://doi.org/10.1109/ICEMIS.2017.827311

[14] Bessaies, Essia, Slim Mesfar, and Henda Ben

Ghezala. "Generating Answering Patterns

from Factoid Arabic Questions." In

Proceedings of the Linguistic Resources for

Automatic Natural Language Generation-

LiRA@ NLG, pp. 17-24. 2017, (2023, June). .

https://doi.org/10.18653/v1/w17-3803.

[15] Al-Smadi, Mohammad, Islam Al-Dalabih,

Yaser Jararweh, and Patrick Juola..

“Leveraging Linked Open Data to

Automatically Answer Arabic Questions.”

IEEE Access, 7 (March), 2019: 177122–36.

https://doi.org/10.1109/ACCESS.2019.29562

33.

[16] Al-Kabi, Mohammed, Izzat Alsmadi, Rawan

T. Khasawneh, and Heider Wahsheh.

"Evaluating social context in arabic opinion

mining." Int. Arab J. Inf. Technol., 15, no. 6

(2018): 974-982.

[17] Hammoud, Jaafar, Natalia Dobrenko, and

Natalia Gusarova. "Named entity recognition

and information extraction for Arabic medical

text." In Multi Conference on Computer

Science and Information Systems, MCCSIS,

pp. 121-127. 2020, (2023, July), [Online].

https://www.iadisportal.org/digital-

library/named-entity-recognition-and-

information-extraction-for-arabic-medical-text

(Accessed Date: July 1, 2024).

[18] Hammo, Bassam, and Steven Lytinen..

"QARAB: A Question Answering System to

Support the Arabic Language." ACL2002:

Computational Approaches to Semitic

Languages, 11, 2002, (2023, April),

https://doi.org/10.3115/1118637.1118644.

[19] Al-Ghamdi, S., Al-Khalifa, H. and Al-

Salman, A., Fine-Tuning BERT-Based Pre-

Trained Models for Arabic Dependency

Parsing. Applied Sciences, 13(2023, July),

p.4225. https://doi.org/10.3390/app13074225.

[20] Albarghothi, Ali, Feras Khater, and Khaled

Shaalan.. "Arabic Question Answering Using

Ontology." Procedia Computer Science, 117,

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2024.21.34

Afnan H. Alshehri

E-ISSN: 2224-3402

371

Volume 21, 2024

2017: 183–91.

https://doi.org/10.1016/j.procs.2017.10.108.

[21] Torki, Marwan, Maram Hasanain, and Tamer

Elsayed. "QU-BIGIR at SemEval 2017 Task

3: Using Similarity Features for Arabic

Community Question Answering Forums,

Proceedings of the 11th International

Workshop on Semantic Evaluation (SemEval-

2017)", 360–64, (2023, July).

https://doi.org/10.18653/v1/s17-2059.

[22] Safaya, Ali, Moutasem Abdullatif, and Deniz

Yuret.. "KUISAIL at SemEval-2020 Task 12:

BERT-CNN for Offensive Speech

Identification in Social Media."

ArXiv:2007.13184 [Cs], July, 2020, (2023,

May). http://arxiv.org/abs/2007.13184.

[23] Romeo, Salvatore, Giovanni Da San Martino,

Yonatan Belinkov, Alberto Barrón-Cedeño,

Mohamed Eldesouki, Kareem Darwish,

Hamdy Mubarak, James Glass, and

Alessandro Moschitti. “Language Processing

and Learning Models for Community

Question Answering in Arabic.” Information

Processing and Management, 56 (2), 2019:

274–90.

https://doi.org/10.1016/j.ipm.2017.07.003.

[24] Aouichat, Asma, Mohamed Seghir Hadj

Ameur, and Ahmed Geussoum.. “Arabic

Question Classification Using Support Vector

Machines and Convolutional Neural

Networks.” In Natural Language Processing

and Information Systems, edited by Max

Silberztein, Faten Atigui, Elena Kornyshova,

Elisabeth Métais, and Farid Meziane,

10859:113–25. Lecture Notes in Computer

Science. Cham: Springer International

Publishing, 2018, (2023, October).

https://doi.org/10.1007/978-3-319-91947-

8_12.

[25] Hamza, Alami, Noureddine En-Nahnahi,

Khalid Alaoui Zidani, and Said El Alaoui

Ouatik.. "An Arabic Question Classification

Method Based on New Taxonomy and

Continuous Distributed Representation of

Words." Journal of King Saud University -

Computer and Information Sciences, 1–7,

2019.

https://doi.org/10.1016/j.jksuci.2019.01.001.

[26] Tang, Haoyang, Meng Qian, Ziwei Sun, and

Cong Song.. Visual Question Answer System

Based on Bidirectional Recurrent Networks.

Advances in Intelligent Systems and

Computing. Vol. 891. Springer International

Publishing, 2019. https://doi.org/10.1007/978-

3-030-03766-6_67.

[27] Mozannar, Hussein, Elie Maamary, Karl El

Hajal, and Hazem Hajj, "Neural Arabic

Question Answering." In Proceedings of the

Fourth Arabic Natural Language Processing

Workshop, 108–118. Florence, Italy:

Association for Computational Linguistics,

2019, (2023, July).

https://doi.org/10.18653/v1/W19-4612.

[28] Nozza, Debora, Federico Bianchi, and Dirk

Hovy. "What the [mask]? making sense of

language-specific BERT models." arXiv

preprint arXiv:2003.02912, March, 2020,

(2023, December).

http://arxiv.org/abs/2003.02912.

[29] Imane Lahbari, Hamza Alami, and Khalid

Alaoui Zidani. “Towards a Passages

Extraction Method”, Springer International

Publishing. 2020, Morocco, 2019, (2023,

Novmber), https://doi.org/10.1007/978-3-030-

36653-7.

[30] Ismail, Walaa Saber, and Masun Nabhan

Homsi.. "DAWQAS: A Dataset for Arabic

Why Question Answering System." Procedia

Computer Science, 142, 2018: 123–31.

https://doi.org/10.1016/j.procs.2018.10.467.

[31] Clark, Jonathan H., Eunsol Choi, Michael

Collins, Dan Garrette, Tom Kwiatkowski,

Vitaly Nikolaev, and Jennimaria Palomaki..

"TyDi QA: A Benchmark for Information-

Seeking Question Answering in Typologically

Diverse Languages." ArXiv:2003.05002 [Cs],

March, 2020, (2023, November).

http://arxiv.org/abs/2003.05002.

[32] Benajiba, Yassine, Paolo Rosso, and José

Manuel Gómez Soriano.. "Adapting the JIRS

Passage Retrieval System to the Arabic

Language." In Computational Linguistics and

Intelligent Text Processing, edited by

Alexander Gelbukh, 530–41. Lecture Notes in

Computer Science. Berlin, Heidelberg:

Springe, 2007r, (2023, September).

https://doi.org/10.1007/978-3-540-70939-

8_47.

[33] Trigui, Omar, Lamia Hadrich Belguith, and

Paolo Rosso. "DefArabicQA: Arabic

definition question answering system." In

Workshop on Language Resources and

Human Language Technologies for Semitic

Languages, 7th LREC, Valletta, Malta, pp.

40-45. 2010, (2023, September). DOI:

10.1109/NLPKE.2009.5313730.

[34] Abouenour, Lahsen, Karim Bouzouba, and

Paolo Rosso, "An Evaluated Semantic Query

Expansion and Structure-Based Approach for

Enhancing Arabic Question/Answering."

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2024.21.34

Afnan H. Alshehri

E-ISSN: 2224-3402

372

Volume 21, 2024

International Journal on Information and

Communication Technologies, 3 (3): 37–51,

2010, [Online].

https://www.researchgate.net/publication/280

253859_An_evaluated_semantic_QE_and_str

ucture-

based_approach_for_enhancing_Arabic_QA

(Accessed Date: July 1, 2024).

[35] Anselmo Caroline Sporleder, and Eduard H.

Hovy Pamela Forner lvaro Rodrigo Richard

FE Sutcliffe Corina Forascu Peas. “Neural

Arabic Question Answering” In CLEF

(Notebook Papers/Labs/Workshop), (2023,

October), 2011. DOI:10.18653/v1/W19-4612.

[36] Akour, Mohammed, Sameer O. Abufardeh,

Kenneth Magel, and Qasem Al-Radaideh..

"QArabPro: A Rule Based Question

Answering System for Reading

Comprehension Tests in Arabic." American

Journal of Applied Sciences, 8 (6): 652–61,

2011.

https://doi.org/10.3844/ajassp.2011.652.661.

Contribution of Individual Authors to the

Creation of a Scientific Article (Ghostwriting

Policy)

The authors equally contributed in the present

research, at all stages from the formulation of the

problem to the final findings and solution.

Sources of Funding for Research Presented in a

Scientific Article or Scientific Article Itself

No funding was received for conducting this study.

Conflict of Interest

The authors have no conflicts of interest to declare.

Creative Commons Attribution License 4.0

(Attribution 4.0 International, CC BY 4.0)

This article is published under the terms of the

Creative Commons Attribution License 4.0

https://creativecommons.org/licenses/by/4.0/deed.en

_US

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2024.21.34

Afnan H. Alshehri

E-ISSN: 2224-3402

373

Volume 21, 2024