Arabic Dialects System using Hidden Markov Models (HMMs)
ZAKARIA SULIMAN ZUBI, EMAN JIBRIL IDRIS
Department of Computer Science, Faculty of Science, Sirte University, Sirte, LIBYA
Abstract: - The Arabic language has many different dialects and it must be recognized before using the
automatic speech recognition (ASR). On the other hand, it is observed in all Arab countries that the standard
Arabic language is widely written and used in an official speech, newspapers, public administration, and
schools but it is not used in daily conversations instead the dialect is widely spoken in daily life and rarely
written. In this paper, we examine the difficult task of properly identifying various Arabic dialects and propose
a system developed to identify a set of four regional and modern standard Arabic speeches, based on speech
recognition using Hidden Markov Models (HMMs) algorithms. HMMs have become a very popular way to
build a speech recognition system. It is set as hidden states and possibilities of transition from one state to
another. Due to the similarities and differences between the Arabic dialects, speeches collected from the ADI5
datasets were retrieved from the MGB-3 challenge source. We proposed an Arabic Dialect Identification
System called "Building a System for Arabic Dialects Identification based on Speech Recognition using
Hidden Markov Models (HMMs)" that takes Input as speech utterances and produces output as dialect being
spoken. During the training phase, speech utterances from one or more dialects were analyzed to capture the
important properties of audio signals in terms of time and frequency. During the testing phase, previously
unseen test utterances were utilized to the system, and the system outputs the dialect associated with the model
of dialect that most closely matches the test utterance. The proposed model of the system shows promising
results of the model for each dialect match.
Key-Words: - Arabic Dialect Identification (ADID), Hidden Markov Models (HMMs), Automatic Speech
Recognition (ASR).
Received: October 23, 2021. Revised: September 11, 2022. Accepted: October 13, 2022. Published: November 10, 2022.
1 Introduction
The principle of dialects in any language represents
a challenge to Machine Learning (at Automatic
Speech Recognition (ASR) systems) and many
important Natural Language Processing (NLP)
applications such as machine translation, social
media analysis, etc. Since a great deal of work on
the, automatic identification (AID) of languages
from the speech signal alone were accomplished
widely. Recently, dialect identification has begun to
receive attention from the speech science and
technology communities. Spoken dialect
identification (DID) is the process of identifying the
spoken dialect within speech. This task must be
performed without knowing any information about
spoken speech. The Arabic language has multiple
variants, including Modern Standard Arabic (MSA),
the formal written standard language of the media,
culture, and education, and the informal spoken
dialects that are the preferred method of
communication in daily life. While there are
commercially available Automatic Speech
Recognition (ASR) systems for recognizing MSA
with low error rates (typically trained on Broadcast
News), these recognizers fail when a native Arabic
speaker speaks in his/her regional dialect. Even in
news broadcasts, speakers often mix between MSA
and dialect, especially in conversational speech,
such as that found in interviews and talk shows.
Being able to identify dialect via MSA as well as to
identify which dialect is spoken during the
recognition process will enable ASR engines to
adapt their acoustic, pronunciation, morphological,
and language models appropriately and thus
improve recognition accuracy, [1].
The root of every current Automatic Speech
Recognition (ASR) system basically consists of a
set of statistical models that display the different
sounds of the language to be identified. Hidden
Markov models are one way to automatically
recognize spoken speech. Speech has a temporal
structure and can be disguised as a series of spectral
vectors that cover a wide range of sound
frequencies. Hence the Hidden Markov Model
(HMM) provides a natural framework for building
such models, [2].
In addition, the Hidden Markov Model (HMM) is
one of the most important machine learning models
used for the purpose of Automatic Speech
Recognition (ASR) systems for the task of dialect
WSEAS TRANSACTIONS on COMPUTERS
DOI: 10.37394/23205.2022.21.37
Zakaria Suliman Zubi, Eman Jibril Idris
E-ISSN: 2224-2872
304
Volume 21, 2022
identification. The Hidden Markov Model is the
basis for a set of successful acoustic modeling
techniques in speech recognition systems. The
reasons for this success are due to the analytical
ability of this model in the phenomenon of speech
and its accuracy in practical speech recognition
systems.
1.1 Identification of Arabic Dialects
Dialect Identification (DID) problem is a special
case of the more general problem of Language
Identification (LID). LID refers to the process of
automatically identifying the language class for a
given speech segment or text document, while DID
classifies between dialects within the same language
class, making it a more challenging task than LID.
A good DID system used as a front-end to an
automatic speech recognition system can help
improve the recognition performance by providing
dialectal data for acoustic and language model
adaptation to the specific dialect being spoken, [3].
The Applications of speech based DID can be
broadly categorized into two classes in relation to
their end users: human operators or machines. As
for human operators, speech based DID systems can
be used in routing calls, provisioning assistance and
more. On the other hand, for the machines,
numerous domains use DID such as: detection and
classification of spoken documents, document
retrieval, enhancing the performance of automatic
speech/speaker recognition, [4].
Most dialects identification systems operate in
two phases: training and recognition. During the
training phase, the typical system is presented with
examples of speech from a variety of dialects.
Fundamental characteristics of the training speech
then can be used during the second phase of dialect
identification: recognition. During recognition, a
new utterance is compared to each of the dialect
dependent models. Each dialect has characteristics
that are different from one dialect to another. We
need to examine the sentence as a whole to
determine the acoustic signature of the dialect, the
unique characteristics that make one dialect sound
distinct from another, [5]. In figure 1, we illustrate
the variations in dialects across the Arab world. The
figure shows that dialects are a continuum that often
transcends geographic regions and borders.
Acoustic features are obtained from the raw
speech signal and these are extracted without
knowledge of language. Dialect Identification DID
also has three major phases having feature
extraction, training and testing phase. The
methodology of feature vector extraction and kind
of feature vector affects the performance of DID as
these feature vectors are input for training and
testing phase.
In the training phase, generally reference models
are created such that one for each language using
statistical models like Gaussian Mixture Model
(GMM), Hidden Markov Model (HMM), Neural
Networks (NN) ..etc. These reference models are a
compact representation of a huge speech corpus of a
particular language. In testing phase, the input test
speech utterance is labelled with one of known
language (represented by reference models) based
on the decision criteria
Arabic dialects differ across several dimensions:
mainly according to geography and social class. As
for the geographical aspect of the language, the
Arabic dialects can be divided in many different
ways. The following is only one of many (and not
all members of any particular dialect group should
be considered completely linguistically
homogeneous):
In this study, we will test our approach on the
following four Arabic dialects with Modern
Standard Arabic (MSA).
- Gulf Arabic (GLF): includes the dialects of
Kuwait, Saudi Arabia, Bahrain, Qatar, United
Arab Emirates, and Oman.
- Levantine Arabic (LEV): includes the dialects
of Lebanon, Syria, Jordan, Palestine, and Israel.
- Egyptian Arabic (EGY): covers the dialects of
the Nile valley: Egypt and Sudan.
- North Africa (NOR): covers the dialects of
Morocco, Algeria, Tunisia, and Mauritania.
Fig. 1: Geographical distribution of Arabic dialects.
Source:https://en.wikipedia.org/wiki/Varieties_of_Arabi,
country codes and regions are added
1.2 Related Work
There are many related works that examine the
dialect and identify it in many ways. In reviewing
these works we focus on two main factors:
Type of recognition system used.
WSEAS TRANSACTIONS on COMPUTERS
DOI: 10.37394/23205.2022.21.37
Zakaria Suliman Zubi, Eman Jibril Idris
E-ISSN: 2224-2872
305
Volume 21, 2022
Results have been obtained.
The Spoken Arabic dialects identification: The
case of Egyptian and Jordanian dialects defined in
[M. Al-Ayyoub, 2014][6] designed an acoustic
model using fixed-size segmentation for which they
extracted the features using Wavelet transform with
a significant feature reduction. They deal with two
dialects Jordanian and Egyptian. They achieved a
97% precision barrier.
The Speech Recognition of Moroccan Dialect
Using Hidden Markov Models mentioned, in [B.
Mouaz, 2019][7], The purpose of this work are to
verify the ability of HMM Speech Recognition
System to distinguish the vocal print of speakers,
and identify them by giving each of them a specific
class. This is done through creating a speech
recognition system, and applying it to a Moroccan
Dialect speech. By investigating the extracted
features of the unknown speech and then comparing
them to the stored extracted feature vectors for each
different speaker, in order, to identify the unknown
speaker. The model utilized in this work was the
Hidden Markov Model. The MFCC + Delta + Delta-
Delta features performed best reaching an
identification score. The accuracy of HMMSRS is
about 90%.
Automatic Identification of Arabic Dialects were
showed in [M. Belgacem, 2010][8], A new model
has been presented in this work based upon the
features of Arabic dialects, nine dialects (Tunisia,
Morocco, Algeria, Egypt, Syria, Lebanon, Yemen,
Golf's Countries, and Iraq); namely, a model that
recognizes the similarities and differences between
each dialect. The model utilized in this work was the
Gaussian Mixture Models (GMM). Therefore, this
new initialization process is used and yields a better
system performance of 73.33%.
Swedish dialect classification using Artificial
Neural Networks and Gaussian Mixture Models,
which are indicated in [V. Blomqvist and D.
Lidberg, 2017][9], is a thesis which investigated the
classification of seven Swedish dialects based on the
SweDia2000 database. The classification was done
using Gaussian mixture models, which are a widely
used technique in speech processing. Inspired by
recent progress in deep learning techniques for
speech recognition, convolutional neural networks,
and multi-layered perceptron's were also
implemented. The Gaussian mixture models reached
the highest accuracy of 61.3% on a test set, based on
single-word classification. Performance is greatly
improved by including multiple words, achieving
around 80% classification accuracy using 12 words.
Multi-Dialect Arabic Broadcast Speech
Recognition was mentioned in [A. M. A. M. Ali,
2018] [10], is also a thesis which investigated Multi-
Dialect Arabic Automatic Speech Recognition
(ASR) with no prior knowledge about the spoken
dialect. In this study, they proposed Arabic as a
five-class dialect challenge consisting of the
previously mentioned four dialects as well as MSA.
They also investigated the different approaches for
ADI in a broadcast speech. They studied both
generative and discriminative classifiers, and
combined these features using a Multi-Class
Support Vector Machine (SVM), Deep Neural
Network (DNN), and Convolutional Neural
Network (CNN). They validated their results on an
Arabic/English language identification task, with an
accuracy of 100%. As well as they evaluated these
features in a binary classifier to discriminate
between MSA and DA, with an accuracy of 100%.
Arabic Speech Recognition System Based on
MFCC and HMMs illustrated in [H. A. Elharati, M.
Alshaari, 2020][5], the primary contribution of this
work was to design an Arabic ASR system and find
the performance of the selected Arabic words that is
successfully verified and examined. For this
purpose, 24 Arabic words were recorded from
native speakers, all the experiments were conducted,
and the recognition results of the ASR system were
investigated and evaluated. The system is designed
by MATLAB based on MFCC and discrete-
observation multivariate HMM. The best
recognition rate reaches 92.92% (51 total error
counts from 1368 total words count).
An Automatic identification of Arabic dialects
using Hidden Markov Models declared in [F. S.
Alorifi, 2008][11], used an ergodic HMM to model
phonetic differences between two Arabic dialects
(Gulf and Egyptian Arabic) employing standard
MFCC (Mel Frequency Cepstral Coefficients) and
delta features. The best parameter setting of this
system achieves high accuracy of 96.67% on these
two dialects.
Our proposed system addresses the problem of
Arabic dialect identification. On the other hand, we
used a dataset of audio examples for Four Arabic
dialects (Gulf, Levantine, North Africa and
Egyptian Arabic) with Modern Standard Arabic
(MSA). The proposed system will use the Hidden
Markov Model methods to build the dialects models
for the dialect identification task.
2 Motivations and the Statement of
the Problem
The motive, through which we chose the subject of
this work, is that there is only one standard Arabic
WSEAS TRANSACTIONS on COMPUTERS
DOI: 10.37394/23205.2022.21.37
Zakaria Suliman Zubi, Eman Jibril Idris
E-ISSN: 2224-2872
306
Volume 21, 2022
language, and all other Arabic languages are Arabic
dialects that are considered as derived from it. In
most cases all dialects express one thing in the
original Arabic language, but in different ways. This
motivated us to prompt and define the research
problem first in general in the following points: -
(1) Arabic Language research is growing
very slowly compared to English Language
research. Mainly the reason for this slow
growth is due to the lack of recent studies
on the phonetic nature of the Arabic
language and the difficulties in speech
recognition.
(2) Most Automatic Speech Recognition
(ASR) systems for Arabic are based on the
Modern Standard Arabic (MSA) Language,
and in fact, most people speak regional
dialects. Therefore, determining the Arabic
dialect from the input speech will help ASR
in the Arabic language for optimal
performance.
(3) The identification of dialects is still one
of the facing problems.
(4) Few recent studies examining Arabic
dialects for speech recognition purposes.
Secondly, in particular, we may define the
problem of this study as follows:
The issues of speech recognition systems in
identifying Arabic dialects can be represented by
finding the most suitable sequence of utterance
based on the segment of the Arabic dialect sound.
Suppose that O stands for an acoustic observation
sequence, obtained by the sequence of word for
each Arabic dialect [in our work the Dialects were
Egyptian (EGY), Levantine (LAV), Gulf (GLF),
North African (NOR), and Modern Standard Arabic
(MSA)].
On the other hand, using the hidden Markov
model for recognizing the Arabic dialect
identification problem based on speech recognition
aims to search for a sequence of words that is
translated into a sequence of the Hidden Markov
model . Thus, a model reference is created for each
dialect D to perform the comparison with unknown
words to determine the specific dialect.
The main problem is to determining the Arabic
dialect based on the speech recognition conditional
maximization holds in the equation 1 as following:
(1)
Where D* = D1, D2, …, Dn, and D is a number of
possible Dialects and O =o1, o2...oT, are denoted as a
sequence of acoustic observations. Therefore, by
applying Bayesian rule to find the
probability which was computed by using the
equation 2 as follows:
(2)
3 The Objectives of the Research
The main objective of this study is to build a system
that automatically identifies the Arabic dialects from
the input speech model (file) using HMM model.
This objective can be achieved in the following
means: (1) We will use the five classes Arabic
Dialect Identification ADI-5 dataset from
the MGB-3 challenge data source to
examine our proposed model.
(2) To segment and label Arabic corpora
that is suitable for implementing our aim.
(3) To analyze the extracted features using
the Mel-frequency Cepstral Coefficients
(MFCC) algorithm.
(4) To improve the accuracy of the Arabic
dialect identification system to classify and
identify Arabic dialects based on Automatic
Speech Recognition (ASR) using Hidden
Markov Model (HMM).
(5) Testing and evaluating our implemented
approach.
4 Methodologies
In this section, the methods and techniques that are
used to achieve the objectives of this paper are
presented in the following sections.
4.1 Automatic Speech Recognition (ASR) for
Dialect Identification (DID)
Automatic spoken language identification is defined
as the process that determines the identity of the
language spoken in a speech audio sample. The
importance of DID can be gauged from the growing
interest in automatic speech recognition. A good
language recognition system can facilitate labelling
the language of a speech segment for many tasks
like multilingual speech processing, such as spoken
language translation, spoken document retrieval,
metadata labelling and multilingual
speech recognition [12]. The same principle can be
applied on automatic spoken dialect identification
that can help reduce the ASR word error rate for
dialectal data by training ASR systems for each
WSEAS TRANSACTIONS on COMPUTERS
DOI: 10.37394/23205.2022.21.37
Zakaria Suliman Zubi, Eman Jibril Idris
E-ISSN: 2224-2872
307
Volume 21, 2022
dialect, or by adapting the ASR models to a specific
dialect. Automatic speech recognition (ASR) is a
process that converts an acoustic signal, captured by
the device microphone or over a telephone line, to a
set of textual words. Over the years, ASR systems
have been developed for many via-voice
applications. Examples include: speech to speech
translation, dictation, Computer aided language
learning, and voiced based information retrieval etc.
[10].
Accurate acoustic models (AM) are a significant
requirement of automatic speech recognizers.
Acoustic modeling of speech describes the relation
between the observed feature vector sequence,
derived from the sound wave, and the non-
observable sequence of phonetic units uttered by
speakers. The major concerns of the automatic
speech recognition are determining a set of
classification features and finding a suitable
recognition model for these features.
We demonstrated HMMs to be a special case of
regular Markov models. It became a more powerful
model for representing time varying signals as a
parametric random process. It works perfectly when
some input occurs as a new state will be generated.
The "hidden" model in Hidden Markov means that
changes from the old state to the new state are not
directly observable and that the transition
probability depends on how the model is trained
with the training sets,[13].
Based on that, to make the hidden Markov model
work, we need to train the model using training sets.
The training we used will hold all classes to be
graded since the model will only learn from what
has been trained. The Training sets will be trained
more likely than the test set since the more data we
get, the more information the model will be able to
learn.
Typically, modern ASR systems represent the
speech signal using state-of-the-art Mel Frequency
Cepstral Coefficients (MFCCs). The Hidden
Markov Models (HMMs) are then used to model the
MFCCs observation sequence. These features are
computed every 10 ms with an overlapping analysis
window of 25 ms.
Automatic speech recognition consists of a
numerous component Figure 2, below illustrates the
key tasks of ASR and their components, [14].
Fig. 2: The components and the key tasks of ASR
(1) Speech Signal Processing: In this process, the
speech signal is converted to a set of feature
vectors.
(2) Acoustic Models: The representation of
knowledge about acoustic, phonetics, and the
speaker variability are included in the models.
Hidden Markov Models are the foundation for
acoustic phonetics models. The acoustic
models are modified during training to ensure
that system performance is optimized.
(3) Language Models: The knowledge of the
system about what words are likely to appear
together, in what sequence, and what the
possible words are.
(4) The Recognition Algorithm
(Decoder): The most important component of
the ASR systems and it was represented as the
reason behind the ASR system. For each audio
frame, there is a process of pattern matching.
Hence, the decoder evaluates the received
feature against all other patterns. The best
match can be achieved when more frames are
processed or when the language model is
considered.
Acoustic features are obtained from the raw
speech signal and these are extracted without
knowledge of language. Dialect Identification DID
also has three major phases having feature
extraction, training and testing phase. The
methodology of feature vector extraction and kind
of feature vector affects the performance of DID as
these feature vectors are input for training and
testing phase.
In the training phase, generally reference models
are created such that one for each language using
statistical models like Gaussian Mixture Model
(GMM), Hidden Markov Model (HMM), Neural
Networks (NN)etc. These reference models are a
compact representation of a huge speech corpus of a
particular language. In the testing phase, the input
test speech utterance is labelled with one of known
WSEAS TRANSACTIONS on COMPUTERS
DOI: 10.37394/23205.2022.21.37
Zakaria Suliman Zubi, Eman Jibril Idris
E-ISSN: 2224-2872
308
Volume 21, 2022
languages (represented by reference models) based
on the decision criteria.
4.2 Hidden Markov Models (HMMs)
HMM is a probabilistic model for machine learning
and language processing. It is mostly used in speech
recognition, [20], [21], to some extent; it is also
applied for the classification task. HMM provides
solutions of three problems: evaluation , decoding,
and learning to find the most likelihood
classification. The core idea in using HMM for
speech recognition applications is to create a
stochastic model as shown in figure 3, from known
utterances and compare it with the unknown
utterances generated by the speaker. An HMM λ is
defined by a set of states N the individual states are
denoted by S = {S1; S2; …; SN}, and the state at
time t is qt. that have O observation symbols as
well as, three possibility metrics for each state
which are in (Equation 3), [15].
λ = (A, B, π) (3)
Where:
A: a set of state transition probabilities A =
aij
aij = P [qt+1 = Sj | qt = Si ] for 1 ≤ i ; j ≤ N.
B: A probability distribution in each of the
states B = bj(k) in which
bj(k) = P[Ok at t| qt = Sj ] where 1 ≤ k ≤ M 1 ≤ j ≤ N.
: The initial state distribution in which
= P [q1 = Si] where 1 ≤ i ≤N.
Fig. 3: The stochastic model using HMM for speech
recognition
HMMs are designed and analyzed with three
associated problems. These problems are:
(1) Evaluation problem: deals with evaluation of
probability/likelihood value of observation
sequence against given an HMM. in another
meaning, computing the likelihood P(O/λ), the
probability of model λ emitting observation
sequence O= O1; …; OT. With this problem,
testing is performed with Forward and Backward
algorithms, [16].
A. Forward Algorithm
Let αt(i) be the probability of the partial
observation circuit Ot = {o(1), o(2), ..., o(t)} to
produce all possible state sequences at the i-th
state.
αt(i) = P(o (1), o(2), ..., o(t)|q(t) = qi) (4)
The probability of the partial observation
sequence is the sum of αt(i) for all N states.
B. Backward Algorithm
In a similar manner, the backward variable βt(i)
is the probability in the partial observation
sequence of o(t+1) to the end that will be
generated by all state sequences, starting at state
i-th. Backward algorithm counts backward
variables back and forth along the observation
sequences.
βt(i) = P(o(t+1), o(t+2), ... , o(T) | q(t) = qi)
(5)
(2) Learning problem: for training purposes the
model is responsible for storing data collected
for a specific dialect class (i.e., in our work the
dialects were (EGY, GLF, LAV, MSA, and
NOR). We will adjust the model parameter by
λ = (A, B, π) to maximize P(O|λ) [16]? The most
difficult thing is to adjust the model parameters
(A,B,π) to maximize the probability of a given
sequence of observations with this problem, the
testing process is performed with the Baum-
Welch algorithm.
Let ξ t (i, j), the combined probabilities are in qi
state at time t and state qj at time t + 1, given the
model and sequence observed:
Where it will be obtained by;
The output sequence probability can be
expressed as follows;
WSEAS TRANSACTIONS on COMPUTERS
DOI: 10.37394/23205.2022.21.37
Zakaria Suliman Zubi, Eman Jibril Idris
E-ISSN: 2224-2872
309
Volume 21, 2022
The probability will be in state qj at time t;
(3) Decision problem: given observations O = O1,
O2, O3 …OT, and model λ = (A, B, π), is to
choose the corresponding state sequence Q = q1
q2 · · · qT which is optimal in some meaningful
sense [16]. To solve this problem we will use the
Viterbi algorithm to compare between the
training and the testing data and find otu the
optimal scoring path of state sequence by
selecting the high probabilities between the
model and the testing data.
The Viterbi algorithm chooses the best hidden
state sequence that maximizes the likelihood of
the state sequence for the given observation
sequence. Let δt(i) be the maximum probability
of the state sequence, the length t ends with state
i and yields the first observation for the given
model. δt(i)=max{P(q(1),q(2),...,q(t-1);
o(1),o(2),…,o(t)|q(t)=qi)}
(10)
With the advantages of HMM, we use HMM to
create a reference model for each dialect included in
our paper and design HMMs with different number
of states such as three or four or five.
5 The Proposed System Architecture
The aim of this proposed system is to assign an
audio signal to each appropriate Arabic dialect
entry. The main idea is to compare the phoneme
model representing the input audio signal with the
reference models for the different Arabic dialects.
According to the results of this comparison, we
assign the audio input signal to the class that
reduces cosine similarity. Figure 4 describes the
operation of the proposed system. The diagram
below is an abstract view using standard flowchart
notation to illustrate the processes and their links.
Fig. 4: Proposed System Flowchart
5.1 Dataset
In our proposed system, we will use the sample five
classes Arabic Dialect Identification ADI-5 dataset
obtained from Multi-Genre Broadcast (MGB)
competition, to implement the practical part in the
proposal work. This dataset will be used in training
and testing the system for each dialect.
The MGB challenge is a core evaluation of speech
recognition, speaker diarization, lightly supervised
alignment, and dialect identification using TV
recordings from the BBC and Aljazeera, as well as
YouTube videos, [17].
MGB-3 Challenge: The third edition of the MGB
challenge is the MGB-3 for ASRU-2017, [18].
MGB-3 focuses on dialectal Arabic (DA) using a
multi-genre collection of Egyptian YouTube videos.
Seven genres were used for the data collection. The
MGB-3 is using 16 hours of multi-genre data
collected from different YouTube channels, [19],
[20]. In 2017, the challenge featured two new
Arabic tracks based on TV data from Aljazeera as
well as YouTube recordings.
The dataset for the ADI supplied with more than 50
hours labeled for each dialect. This will be divided
across the five major Arabic dialects; Egyptian
(EGY), Levantine (LAV), Gulf (GLF), North
African (NOR), and Modern Standard Arabic
(MSA). Table 1 presents some statistics about the
training and test datasets [18].
WSEAS TRANSACTIONS on COMPUTERS
DOI: 10.37394/23205.2022.21.37
Zakaria Suliman Zubi, Eman Jibril Idris
E-ISSN: 2224-2872
310
Volume 21, 2022
Table 1. The number of hours and utterances of data
available for each dialect for training and testing.
5.2 Pre-processing
The functionality of the pre-processing stage is to
prepare the input signal to the feature extraction
stage. The main goal of this phase is to get the
speech signal of each word that has been spoken.
Thus, this phase handles any snags or loopholes that
might affect feature extraction. It basically tries to
remove noises and silence gaps. This is important
because noises and silence gaps in the inputs have
very inconsistent properties that can cause
misidentification as well as segmenting the speech
audio file by detecting endpoints. In the Pre-
processing stage, a speech waveform transforms
into a sequence of parameter vectors. This process
will be performed in both the training data and
testing data.
5.3 Feature Extraction
Feature extraction is a fundamental part of any
speech recognition or identification system
like language, dialect, speaker from speech
utterances from speech signal. The performance of
the proposed system depends on feature vectors, the
selection of feature vectors and some other
parameters of features are very important to get
good and significant results.
In case of dialect identification, the selection of
feature vectors must discriminate between the
content of speech i.e., phoneme, sequence of
phoneme and frequency of phoneme.
Features can be extracted from speech signals in
the frequency or time domains as it's indicated in
figure 5, which represents an acoustic features
vector. We will use in this phase the Mel Frequency
Cepstral Coefficients (MFCC) technique, which is a
popular speech feature representation. Mel
Frequency Cepstral Coefficients (MFCC) is an
important feature of a speech signal that reveals
phonemic differences between dialects. In our work,
we extract the MFCC from the short term duration
of the windowed speech signal. This process will be
performed on both the training data and testing data
Fig. 5: Frequency and Time Domains of Audio
Signal.
5.3.1 Mel Frequency Cepstral Coefficients
(MFCC)
MFCC are popular acoustic features and these have
significant results in speech processing tasks. These
features are mainly extracted from preprocessed
speech signals. The steps to extract MFCC from
speech signals are described in Figure 6a.
Fig. 6a: Extraction of MFCC features vectors.
The extraction of Mel Frequency Cepstral
Coefficients features vectors from speech signal
have several steps:
(1) The signal is smoothed by removing noise with
a digital filter (pre-emphasis filter) in order to
improve the system's efficiency performance.
Here is the pre-emphasis filter equation:
y(t) = x(t) αx (t − 1) (11)
y(t) is the result of pre-emphasis signal, x(t) is
the initial signal prior to pre-emphasis, the
constant value for the filter coefficient αx is
0.95, [16]
WSEAS TRANSACTIONS on COMPUTERS
DOI: 10.37394/23205.2022.21.37
Zakaria Suliman Zubi, Eman Jibril Idris
E-ISSN: 2224-2872
311
Volume 21, 2022
Fig. 6b: Result of Pre-emphasis.
(2) Frames and windows, instead of analyzing the
entire speech signal at once, are divided into
overlapping frames with short time duration,
and the frame size is generally 10ms-30ms. The
information at the beginning and end of the
frame is very important. To avoid losing the
information; we overlap frames to preserve the
information. Windowing technology is
implemented as well to avoid stopping in the
signal so that a windowing function is applied
to each frame length using a hamming window.
A mathematical equation hamming window is
represented as follows:
w[n] = 0.54 − 0.46 cos (2πn/N−1) (12)
(3) Apply a fast Fourier transform to get the scale
frequency of each frame of the windowed
speech signal.
Where, N is usually 256 or 512. As well as to
calculate the power spectrum (periodgram) will
obtained by using the following equation:
Where, xi is the ith frame of signal x.
(4) Using Mel scale filter bank, smooth the
spectrum of the speech signal that gets the
spectrum data values from more significant
parts. The Mel frequency scale is a linear
frequency below 1000 Hz and a logarithmic
space above 1000 Hz. The bank filter will be
applied in the frequency domain shown in
figure 7. The converting method between Hertz
(f) and Mel (m) will be achieved by using the
following equation: The formula for converting
from frequency to Mel scale; Mel scale is
defined as equation (1).
Where, f is a frequency in Hz.
Fig. 7: Apply FFT and Mel scale filter bank.
(5) The logarithm is applied to the Mel spectrum
which converts the Mel cepstrum to a time
domain, i.e. Mel cepstrum, and then a discrete
cosine transform (DCT) is applied to the Mel
cepstrum to obtain the coefficient. Discrete
Cosine Transform (DCT) is the last stage in
forming Mel Frequency Cepstrum. The
equation used to calculate DCT is indicated as
following:
(6) Reduce the interrelations between compressed
information and coefficients to coefficients of
lower order.
5.4 Training Phase
During training, spectral features vectors are
extracted from the digitization of training speech
utterances. Given O acoustic features vectors for
each Dialect. Then, by using the forward-backward
algorithm HMMs are designed as one reference
model for each dialect to capture the characteristics
of each Dialect spoken within the speech data by
initializing randomly the parameters (initial
probabilities, transition probabilities, and output
densities) of an HMM for each Dialect D, the result
is A model set, λD where D is a number of possible
Dialects, then using a forward algorithm to calculate
log-likelihood P(O/λD) for each λD.
5.5 Language Model (LM)
The Dialect speech is recognized by classification
phase by using extracted features and a dialect
template where the dialect template contains syntax
and semantics related to the responsible dialect
(13)
(14)
(15)
WSEAS TRANSACTIONS on COMPUTERS
DOI: 10.37394/23205.2022.21.37
Zakaria Suliman Zubi, Eman Jibril Idris
E-ISSN: 2224-2872
312
Volume 21, 2022
which help the classifiers to identify the input
utterance. The language model is an N-gram model
trained separately in which the probability of each
word is conditionally obtained on its N-1
predecessors.
5.6 Testing Phase
During testing, we will study the results of the
acoustic features vector from the feature extraction
phase. That will be extracted from the test sampled
data. Using the Viterbi (Decoding) algorithm of the
feature vector sequences which will be performed
against each of the HMMs, producing a likelihood
score that the given test utterance was produced by
the models. The final step is to select the most
likely model according to:
The Dialect of the model most likely to have
produced the test utterance observations is
hypothesized as the Dialect of the test utterance one
of (EGY, GLF, LAV, and NOR) Dialect.
6 Experiments and Results
In this work, the proposed system will be
implemented to identify the Arabic audio, which
could classify and recognize the dialect of input
speech audio. This section will present the results of
the identification process, as well as the design and
implementation of the proposed identification
system. It focuses only on five major dialects such
as: NOR, EGY, GLF, and LAV with MSA.
In this work also we build and develop an
Automatic Speech Recognition (ASR) system using
Hidden Markov Model (HMM). The proposed
system was implemented and satisfactory
performance was developed using the MATLAB
platform in term to make the system more
interactive and faster.
Our system follows a standard recipe. corpus-
suggested timings are used to segment the audio
data, 13 Mel Frequency Cepstral Coefficients
(MFCCs) as well as their 1st and 2nd derivatives
were extracted with 10 ms using a 25 ms framed
speech signal, and each conversation side was
normalized using mean cepstral and variance
normalization. All the models were trained with
context-dependence triphones by using the
maximum likelihood function. Whereas, each phone
was modelled using left-to-right HMM and the
outcomes are represented in three states.
In this work we investigate the similarity between
each pair of dialects of Arabic through acoustic
models (HMMs), that refer to different type's
dialects of Arabic. The performance of this purpose
will be illustrated by using cosine similarity. Table
2, presents the results of the classification of
dialectical speech and describes the confusion.
The test set of the experiment was defined by 200
utterances of 10-sec and 200 utterances of 45-sec
from each target dialect. The score of each token
sequence was obtained by summing all the log
bigram probabilities given each bigram language
model. For dialect identification purpose, a
maximum likelihood classifier was finally used to
hypothesize the language being spoken in each
utterance.
For classifying the dialect in the testing speech,
the forward score of the speech utterance must be
computed. The five different scores from the
different dialect models were processed by the
maximum likelihood classifier and the one with the
highest log likelihood was taken to be the
hypothesized dialect.
Table 2. The matrix of confusion given by the
acoustic model.
In table 2, we found some confusion between
Arabic dialects. It is clearly shown that the highest
confusion rates are those between GLF and LAV
dialects. This confusion is justified by the closeness
between these pairs of dialects; e.g., GLF and LAV
dialects share significant vocabulary. Figure 8,
illustrated the performance details of our proposed
system for each Arabic dialect.
WSEAS TRANSACTIONS on COMPUTERS
DOI: 10.37394/23205.2022.21.37
Zakaria Suliman Zubi, Eman Jibril Idris
E-ISSN: 2224-2872
313
Volume 21, 2022
Fig. 8: Classification rate for each Arabic dialect
7 Evaluations
Since the dialect identification task is a standard
statistical classification problem, Throughout the
experiments, the performance metric for ADI test
datasets includes Accuracy (ACC), Recall (RCL)
(false negative value), and Precision (PRC) (positive
predictive value).
In this study, we evaluate the testing results by
organizing a report which holds a given dialect d,
which will be calculated by the dialect identification
measurements defined as:
Where Scorrect is the amount of correctly
identified test sequences of all test sequences Sd
voiced in dialect d.
In order to ensure the reliability of our results,
we use a k-fold cross-validation technique with k =
10.
Figure 9, shows the Accuracy, Precision, and
Recall of the system.
Fig. 9: System Performances
8 Conclusions
We conclude in this paper an automatic identifying
Arabic dialects system which had being proposed.
The proposed system called " Building a System for
Arabic Dialects Identification based on Speech
Recognition using Hidden Markov Models (HMMs)
("ADIDSHMM") ", In the Identification Arabic
Dialects system the difficult task of properly is to
identify a various Arabic dialect and examining it.
We applied the classification technique algorithm
called Hidden Markov Models (HMM) to learn the
results of the speech recognition based on the
acoustic features vector from the feature extraction
phase. The classification process in terms of HMM
algorithm via using identification of the dialect of
wav audio is one of five dialects. The feature
extraction process was implemented in which
speech features are extracted for all the speech
samples. All these features are given to the pattern
trainer for training and are trained by HMM to
create HMM models for each dialect. Afterward we
will use the Viterbi algorithm of HMM to select the
one with the maximum likelihood in which it
recognized the dialect.
The dataset we used is widely known as ADI5.
The ADI5 dataset became our experimental dataset
that's created and collected by the MGB-3 challenge
includes a multi-dialectal speech from various
programs recorded from the Al-Jazeera TV channel.
It also includes audio files in Egyptian (EGY),
Levantine (LEV), Gulf (GLF), North African
(NOR), and Modern Standard Arabic (MSA).
Finally, in our experimental results, we
illustrated the overall system performance via four
indices: overall accuracy, average precision and
average recall for the five dialects.
(17)
(19)
WSEAS TRANSACTIONS on COMPUTERS
DOI: 10.37394/23205.2022.21.37
Zakaria Suliman Zubi, Eman Jibril Idris
E-ISSN: 2224-2872
314
Volume 21, 2022
References:
[1] F. Biadsy, J. Hirschberg, and N. Habash, "Spoken
Arabic dialect identification using phonotactic
modeling," in Proceedings of the eacl 2009
workshop on computational approaches to semitic
languages, 2009, pp. 53-61.
[2] A. M. J. E. J. o. E. Deshmukh and T. Research,
"Comparison of hidden markov model and recurrent
neural network in automatic speech recognition,"
vol. 5, no. 8, pp. 958-965, 2020.
[3] F. Biadsy, "Automatic dialect and accent
recognition and its application to speech
recognition," Columbia University, 2011.
[4] H. C. S. Bougrine and A. Abdelali, "Spoken arabic
algerian dialect identification," in 2018 2nd
International Conference on Natural Language and
Speech Processing (ICNLSP), 2018, pp. 1-6: IEEE.
[5] H. A. Elharati, M. Alshaari, V. Z. J. J. o. C.
Këpuska, and Communications, "Arabic Speech
Recognition System Based on MFCC and HMMs,"
vol. 8, no. 03, p. 28, 2020.
[6] M. Al-Ayyoub, M. K. Rihani, N. I. Dalgamoni, and
N. A. Abdulla, "Spoken Arabic dialects
identification: The case of Egyptian and Jordanian
dialects," in 2014 5th International Conference on
Information and Communication Systems (ICICS),
2014, pp. 1-6: IEEE.
[7] B. Mouaz, B. H. Abderrahim, and E. J. P. C. S.
Abdelmajid, "Speech Recognition of Moroccan
Dialect Using Hidden Markov Models," vol. 151,
pp. 985-991, 2019.
[8] M. Belgacem, G. Antoniadis, and L. Besacier,
"Automatic Identification of Arabic Dialects," in
LREC, 2010.
[9] V. Blomqvist and D. Lidberg, "Swedish Dialect
Classification using Artificial Neural Networks and
Guassian Mixture Models," 2017.
[10] A. M. A. M. Ali, "Multi-dialect Arabic broadcast
speech recognition," 2018.
[11] F. S. Alorifi, "Automatic identification of arabic
dialects using hidden markov models," University of
Pittsburgh, 2008.
[12] P. Heracleous, A. Yoneyama, K. Takai, and K.
Yasuda, "Automatic Spoken Language
Identification Using Emotional Speech," in
International Conference on Human-Computer
Interaction, 2020, pp. 650-654: Springer.
[13] N. Thiracitta, H. Gunawan, and G. Witjaksono,
"The comparison of some hidden markov models
for sign language recognition," in 2018 Indonesian
Association for Pattern Recognition International
Conference (INAPR), 2018, pp. 6-10: IEEE.
[14] P. A. Torres-Carrasquillo, T. P. Gleason, and D. A.
Reynolds, "Dialect identification using Gaussian
mixture models," in ODYSSEY04-The Speaker and
Language Recognition Workshop, 2004.
[15] M. J. A. t. D. o. I. I. o. A. I. S. L. Heck, "Automatic
Language Identification for Natural Speech
Processing Systems," 2011.
[16] H. Z. Muhammad, M. Nasrun, C. Setianingsih, and
M. A. Murti, "Speech recognition for English to
Indonesian translator using hidden Markov model,"
in 2018 International Conference on Signals and
Systems (ICSigSys), 2018, pp. 255-260: IEEE.
[17] P. Bell et al., "The MGB challenge: Evaluating
multi-genre broadcast media recognition," in 2015
IEEE Workshop on Automatic Speech Recognition
and Understanding (ASRU), 2015, pp. 687-693:
IEEE.
[18] A. Ali, S. Vogel, and S. Renals, "Speech recognition
challenge in the wild: Arabic MGB-3," in 2017
IEEE Automatic Speech Recognition and
Understanding Workshop (ASRU), 2017, pp. 316-
322: IEEE.
[19] Arabicspeech.org. 2022. MGB3_ADI
ArabicSpeech. [online] Available at:
<https://arabicspeech.org/mgb3-adi/> [Accessed 1
March 2022].
[20] Mgb-challenge.org. 2022. MGB Challenge - MGB-
3. [online] Available at: <http://www.mgb-
challenge.org/MGB-3.html> [Accessed 6 March
2022].
[21] En-Naimani, Z. A. K. A. R. I. A. E., M. O. H. A. M.
E. D. Lazaar, and M. O. H. A. M. E. D. Ettaouil.
"Hybrid system of optimal self organizing maps and
hidden Markov model for Arabic digits
recognition." WSEAS Transactions on
Systems 13.60 (2014): 606-616.
Contribution of Individual Authors to the
Creation of a Scientific Article (Ghostwriting
Policy)
-Zakaria Suliman Zubi, carried out the optimization
as well as the statistics of the article.
-Eman Jibril Idris, carried out the idea and
implemented the algorithms with statistical used of
Hidden Markov Model (HMM) in the ASR system
as well as the code.
Sources of Funding for Research Presented in a
Scientific Article or Scientific Article Itself
The research work was supported by Department of
Computer Science, Faculty of Science, Sirte
University, Sirte, Libya.
Creative Commons Attribution License 4.0
(Attribution 4.0 International, CC BY 4.0)
This article is published under the terms of the
Creative Commons Attribution License 4.0
https://creativecommons.org/licenses/by/4.0/deed.en
_US
WSEAS TRANSACTIONS on COMPUTERS
DOI: 10.37394/23205.2022.21.37
Zakaria Suliman Zubi, Eman Jibril Idris
E-ISSN: 2224-2872
315
Volume 21, 2022