Experimental Speech recognition from pathological voices
HADJI SALAH
Laboratory of nano-materials (LANSER) of Energy Center (CRTEn)
Technopole of Borj-cedria, Hamm-lif, Tunis, 2050
TUNISIA
Abstract: Speech recognition has been the subject of quite a few research subjects as it is the adequate
means for dynamic, efficient and interaction Human communication simultaneously using the two
phenomena of phonation and hearing between speakers, the applications of its searches are enormous
for example one can quote: the dictation, the speech synthesis within the Windows software, the
speech recognition of the Google search engine at the Smartphone level …… ..etc. all its applications
depend on the conditions of use in which they are implemented, to be done and to overcome the
puzzles of imperfection it is necessary to be sure to properly characterize the speech signal by
extracting the most relevant characters such as: the fundamental frequency (pitch in English), timbre,
tonality, to extract them many techniques are possible, the most used of which are acoustic such as:
MFCC, PLP, LPC, RASTA and other in the form of combination (hybridization) namely: PLP
RASTA, MFCC PLP etc. They are used in data transmission, speaker recognition and even in speech
synthesis.
Keywords: speech signal, Parametrisation, SVM, pathological voices, classifier, MFCC, PLP,
RASTA, LPC
Received: April 18, 2021. Revised: August 17, 2022. Accepted: September 23, 2022. Published: October 31, 2022.
1. Introduction
The extraction of acoustic parameters or
characteristics, such as fundamental frequency,
formants, etc. is done by applying signal
processing methods which are for example:
time-frequency analysis, spectral analysis,
Cepstral analysis..etc Parameterization
constitutes the initial block (fig.2) for any
recognition of a speech signal, its role is to
extract from a speech signal the most relevant
information possible in order to be able to
make a separation between the sounds [8]. The
extracted information is presented as a
sequence of acoustic vectors. In order to be
able to extract these parameters, several
methods exist, taking into account the
superposition of the noises of the sounds, we
will make a comparison of the different
methods (MFCC, PLP, PLP RASTA, and the
combination of several other parameters such
as LPC, pitch, forming, energy). Given the
redundancy of the speech signal and its
complexity, to process it, different methods are
admitted to have a better parameterization. In
this paper we will give a brief overview on the
signal processing tools such as short-term
energy and weighting windows,then see the
different speech signal parameterization
methods which are: LPC (Linear Predictive
Coding) analysis, Homomorphic or Cepstral
analysis on which the MFCC (Mel Frequency
Cepstral Coefficient) is based, PLP (Predictive
Linear Perceptual) and PLP-RASTA (Real
Ative SpecTrA).this involves using an SVM
classification to distinguish between speech
signals from people with speech pathology
(Nodule or Oedeme) and normal signals (no
pathology). In this paper, two types of
classifications have been used:
- A classification in two classes: in which we
used samples of corpus from pathological
signals (Nodule and Oedema) and another
from normal signals to make this classification,
WSEAS TRANSACTIONS on SIGNAL PROCESSING
DOI: 10.37394/232014.2022.18.23
Hadji Salah
E-ISSN: 2224-3488
164
Volume 18, 2022
(see Fig.) Shows us the principle of
classification.
- A multi-class classification: in which we took
samples of each type of pathology among the
two that we have (Nodule and Oedme) to
constitute the first and second classes and
samples from normal signals, which is for the
third class. To perform a multi-class
classification, we used a One VS all type of
algorithm, that is to say "one against all", this
algorithm consists of taking a signal and
comparing it with all the classes.
2. Parameterization methods
There are several methods of parameterization;
there are those, which are based on the
perception of the human ear like the MFCC,
PLP and others, which are interested in the
model of speech production such as the
Cepstral method and the LPC.
Fig.2. Classification algorithm
2.1 Cepsral Frequency Coefficients on the Mel
scale (MFCC):
Obtaining the Cepstral Coefficients at the Mel
scale was developed in 1980 by Davis and
Mermelstein, to do this it is necessary to apply
a Hamming window to each frame of the
signal, we then obtain a Cepstral characteristic
vector per frame, then we apply the Discrete
Fourier Transform (DFT). let us then keep the
Logarithm of the amplitude spectrum, then
after smoothing the spectrum let us apply the
Discrete Cosine transform to have the Cepstral
coefficients (see figure).
Fig .1. Mel coefficient calculation process
The extraction of the MFCC coefficients
consists of six steps as mentioned in the
previous figure (Fig. 1) [6]:
Step 1: Pre-emphasis:
This step in the process is to emphasize the
high frequencies, this result in increasing the
energy at the higher frequencies.
Y (n) =X[n]-a. X [n-1] (1)
Stage 2: Segmentation into frames: this stage
consists in fragmenting the signal into frames
of 20 to 40 ms. The speech signal is split into
N samples. Adjacent samples are spaced by M
(M<N), typically the values used are M=100
and N=256.
Step 3: Windowing with Hamming:
Discontinuities related to segmentation can be
overcome by multiplying each frame by a
Hamming window. The Hamming window is
given by the following equation:
If the window is defined as W (n), 0≤n≤N-1
such that:
N: number of samples in each frame
Y[n]: Output signal
X[n]: Input signal
W (n): The Hamming window, so the result
will be:
WSEAS TRANSACTIONS on SIGNAL PROCESSING
DOI: 10.37394/232014.2022.18.23
Hadji Salah
E-ISSN: 2224-3488
165
󰇛󰇜󰇛󰇜󰇛󰇜 (2)
W (n) =0.54-0.46 cos (2πn/ (N-1)) (3)
0≤n≤N-1
Step 4: The fast or short-term Fourier
transform:
To go from the time domain to the spectral
domain, a Fourier transform is applied to each
frame of N samples. The FFT is shown at the
bottom:
Y(w)=FFT[h(t)*x(t)] =H(w)×X(w) (4)
Step 5: Mel Filter Bank
Step 6: Application of the iDCT (Inverse
Discrete Cosine Transform)
2.2 LPC (Linear Predictive Coding)
LPC analysis is based on the speech
production model mentioned in the figure
below [7]. Starting from the hypothesis
modeling the speech by a linear process, then
It is a linear prediction at an instant n of the p
previous samples. However, the non-linearity
of speech requires the existence of an error
denoted e(n) introduced to correct this error
[2].
Fig.3. Speech production model
The LPC consists in calculating the
coefficients ak by minimizing the error. The
following equation presents the process:
󰇛󰇜󰇛󰇜
 󰇛󰇜 (5)
The preacher's equation is:
󰆒󰇛󰇜
 󰇛󰇜 (6)
The prediction error is calculated by the
following equation:
󰇛󰇜󰇛󰇜󰆒󰇛󰇜
󰇛󰇜󰇛󰇜
 . (7)
The problem that troubles researchers is: how
to determine p “optimal” coefficients ak
knowing N samples of a certain signal x[n]
such that the error e(n) is the smallest possible.
To do this, we minimize the energy of the
prediction error e (n), over the duration of the
block of length N. So we need to minimize:
󰇟󰇠


󰇧󰇟󰇠
 󰇟󰇠󰇨

 2 (8)
We get there by setting ∂E/∂ak =0 and for each
ak. This generates a system of p equations with
p unknowns (the ak), which can then be solved
to obtain the ak. The system of equations that
will allow us to calculate the coefficients ak is:
Fig.4. Yule-Walker matrix
Such as:
󰇛󰇜󰇛󰇜󰇛󰇜

 (9)
The transfer function of the filter is determined
by the following equation:
Impuls
WSEAS TRANSACTIONS on SIGNAL PROCESSING
DOI: 10.37394/232014.2022.18.23
Hadji Salah
E-ISSN: 2224-3488
166
󰇛󰇜󰇛󰇜
󰇛󰇜
  (10)
2.3. the PLP technique
PLP (Perceptual Linear Prediction) is a
parameterization technique based on the
human auditory system, it is an improvement
of the one named LPC which estimates the
spectrum over the entire audible band and can
miss certain spectral details. The PLP
estimates the parameters of an all-pole
autoregressive filter, allowing a better
modeling of the auditory spectrum by
introducing critical bands at the level of the
power spectrum with a bank of 17 filters
whose central frequencies are linearly spaced
according to the Bark scale which simulates
the perception of the human ear [3, 4], whose
audible frequencies range approximately from
20 Hz to 22 kHz much closer to perception
than the linear Hertz scale (1 Bark = 100 Mels)
[4].
Denoised speech
Fig.4. PLP coefficients
2.4. the PLP RASTA technique:
PLP RASTA is a hybrid parameterization
technique between Perceptual Linear
Prediction (PLP) and Relative Spectral
Prediction (RASTA). The RASTA technique
allows the identification of the (interesting)
zones by comparing the temporal evolution of
the spectral components with respect to the
vocal tract and removes the others that do not
correspond to them, which are not speech
(noise), or the signal speech is often stained
with noise having a slow variation, RASTA
uses a bank of filters eliminating stationary
signals, this technique makes it possible to
reduce the sensitivity of speech analysis in the
face of slow changes, a band-pass filter is
applied to each spectral component according
to a frequency representation in the critical
band. The transfer function is:
󰇛󰇜 󰇡
 󰇢 (5)
This method gives results against distortions
and its lower quality for additive noises [8].
3. Experimental Results
Two essential steps to carry out the
classification of the pathological paths and
those healthy, to be done the first step is the
parameterization (matrix of the relevant
parameters), or still acoustic vectors extracted
starting from corpus of sounds of the TIMIT
database and from other people with vocal
pathologies, these vectors are injected at the
input of the SVM classifier, the first step is
learning ", and the second step is the test, this
is why the validation base is divided into two
sub-bases one for learning (3/4) and one for
testing (1/4). After a certain number of
executions of the two stages, we can
distinguish the voices of healthy people from
the voices of people who have difficulties
during the production of speech (cold for
example). In the following, we present the
different analyzes:
WSEAS TRANSACTIONS on SIGNAL PROCESSING
DOI: 10.37394/232014.2022.18.23
Hadji Salah
E-ISSN: 2224-3488
167
-The LPC analysis represents the speech signal
by these LPC linear predictive coding
coefficients and is carried out in 4 steps:
Fig.5. LPC analysis of a few samples of a
signal
Then we present the broad and narrow band
spectrogram
- The broadband spectrogram which is
obtained with a window of short duration (3
ms in our project), it makes it possible to
follow the evolution of the formants, the
voiced periods appear there in the form of dark
bands which are vertical.
- The narrow band spectrogram: it is obtained
with a larger window (30 ms), it makes
possible to visualize the harmonics of the
signal in the voiced zones, and they appear in
the form of horizontal bands
Fig.6. Representation of broadband (right) and
narrowband spectrograms
Fig.7. Formants display
- PLP and PLP-RASTA technique
analysis
Fig.8. PLP and PLP-RASTA analysis
In the following we will present the
parameterization matrices of the different
techniques and establish a comparison
- Parametric matrix 1:
This matrix is the first that we will use in the
classification in order to make a comparison
between the performances of the different
methods of parameterization.
This matrix contains 4 columns and 200 rows,
the columns contain the parameter types and
the rows contain the values:
First column: this column contains
samples of the signal
Second column: this second column
contains the short-term energy of the
signal.
Third column: contains the cepstral
coefficients.
WSEAS TRANSACTIONS on SIGNAL PROCESSING
DOI: 10.37394/232014.2022.18.23
Hadji Salah
E-ISSN: 2224-3488
168
Fourth: This column contains the first
12 cepstral coefficients plus the pitch
(F0) and the first three formants (F1
F2 F3)
The following figure shows this matrix:
Fig.9. Parameterization matrix 1
o Parametric matrices 2
This part groups together the 3 most used
parameterization methods in all that is voice
recognition. Each represented by a matrix.
The dimension of these matrices is 13 columns
and 214 rows. That is 214 frames or vectors
and each with 13 coefficients.
First matrix: this matrix contains the
PLP coefficients without the RASTA
filtering.
Second matrix: it contains the MFCC
coefficients.
Third matrix: This third and last
contains the PLP-RASTA coefficients.
The Fig.9, presents these 3 matrices:
Fig.9. Classification of pathological signals VS
Normals
It is a question of using an SVM classification
in order to make a distinction between speech
signals coming from people who suffer from
vocal pathology (Nodule or Oedema) and
normal signals (no pathology).
In our application, two types of classifications
were used:
- A classification into two classes: in which we
used corpus samples from pathological signals
(Nodule and Oedema) and others from normal
signals to make this classification, the figure
(Fig.) shows us the principle of classification.
- A multi-class classification: in which we took
samples from each type of pathology among
the two we have (Nodule and Oedma) to
constitute the first and second classes and
samples from normal signals which is for the
third class. To perform a multi-class
classification, we used a OneVSall-type
algorithm, i.e. "one against all", this algorithm
consists of taking a signal and comparing it
with all the classes.
1. Learning phase
The learning phase consists of creating a basic
model on which the subsequent classification
of signals is based.
This involves taking a speech signal, extracting
its coefficients (MFCC, PLP or PLP-RASTA)
and applying the function dedicated to learning
WSEAS TRANSACTIONS on SIGNAL PROCESSING
DOI: 10.37394/232014.2022.18.23
Hadji Salah
E-ISSN: 2224-3488
169
Fig. 10. Learning phase
The red color corresponds to the parameters
extracted from a signal of a healthy individual
(0) and the green color corresponds when it to
the pathological signals (1)
2. Test phase
The test phase consists of recovering the
matrix resulting from learning in order to
predict or generate a decision, arguments of
SVM classifier:
- Train matrix or learning matrix:
- Data N: is a matrix of the same size as the matrix
used when learning the model, it is a matrix of
data to be classified. Thus the system displays a
message to say that the voice comes either from a
healthy person in terms of voice production or
suffers from a pathology.
The following figure shows an overview of a
classified signal.
Fig.11. Example of a classified signal
The following table shows the results obtained
after applying several parameterization
methods, it should be noted that the signals
used include male and female voices.
Table 1. number of signals for validation from
TIMIT
Pathological
signals
Norma
l
signals
Total
signa
l
Trainin
g
signals
Nodul
e
Oeude
m
13
12
9
34
8
Table 2. Results of recognition
Recognition rate
/2classes
Multi-class
recognition rate
PLP
MFCC
PLP
MFCC
88.23%
76.4%
85%
58%
In this table we have not presented the PLP
RASTA method, because the latter classifies
all the signals as being normal, and this leads
us to say that the PLP RASTA eliminates the
noise which caused the pathological signals to
be considered as such, from suddenly these
signals become normal following RASTA
filtering.
4. Conclusion
In this paper we have developed an application
in Matlab which aims to make a
parameterization in order to perform a
recognition of pathological voices.
This recognition is done using the SVM
classification with several types of acoustic
vectors (PLP, MFCC, and PLP-RASTA).
According to the results obtained during the
tests, we were able to observe that the
parameters generated by the PLP-RASTA
method give a less satisfactory result compared
to the other two methods.
WSEAS TRANSACTIONS on SIGNAL PROCESSING
DOI: 10.37394/232014.2022.18.23
Hadji Salah
E-ISSN: 2224-3488
170
References
[1] René Boite, Hervé Boulard, Thierry Dutoit,
Joël Hancq and Henri Leich, “speech
processing book (ISNB: 2-88074-388-5)”
Presses Polytechniques et Universitaire
Romandes. 2000.
[2] : http://fr.wikipedia.org/wiki/Fenêtrage
[3] Zied Hajaiej, Kais Ouni, Noureddine
Ellouze, “Speech Parametrization Based on
Cochlear Filter Modeling: Application to
PAR”. Article from the Signal Processing and
Systems Laboratory (LSTS). ENIT.
[4] Julien PINQUIER “Sound indexing:
search for primary components for audiovisual
structuring” Thesis in Computer Science.
University of Toulouse III Paul Sabatier.
December 20, 2004
[5] Houda HOSNI, Zied SAKKA,
Abdennaceur KACHOURI and Mounir
SAMET.« Étude de la Paramétrisation RASTA
PLP en vue de la Reconnaissance Automatique
de la Parole Arabe »
[6] Linda salwaMuda, MumtajBegam et I.
Elamvazuthi « Voice Recognition Algorithms
using Mel Frequency Cepstral Coefficient
(MFCC) and Dynamic Time Warping (DTW)
Techniques »Paper from the Journal of
Computing, Volume 2, Issue 3. 3 mars 2010
[7] CHERIF Adnen: Faculty of Sciences of
Tunis El Manar “Audio processing and
transmission course: digitization of the audio
signal”
[8] Houda Hosni, Zied Sakka, Abdennaceur
Kachouri and Mounir Samet. "Study of the
Rasta PLP configuration for automatic
recognition of Arabic speech", LETI
laboratory of National School of Engineers in
Sfax. 2009
Creative Commons Attribution License 4.0
(Attribution 4.0 International, CC BY 4.0)
This article is published under the terms of the Creative
Commons Attribution License 4.0
https://creativecommons.org/licenses/by/4.0/deed.en_US
WSEAS TRANSACTIONS on SIGNAL PROCESSING
DOI: 10.37394/232014.2022.18.23
Hadji Salah
E-ISSN: 2224-3488
171