Experimental Speech recognition from pathological voices

HADJI SALAH

Laboratory of nano-materials (LANSER) of Energy Center (CRTEn)

Technopole of Borj-cedria, Hamm-lif, Tunis, 2050, TUNISIA

Abstract: Speech recognition has been the subject of quite a few research subjects as it is the adequate means

for dynamic, efficient and interaction Human communication simultaneously using the two phenomena of

phonation and hearing between speakers, the applications of its searches are enormous for example one can

quote: the dictation, the speech synthesis within the Windows software, the speech recognition of the Google

search engine at the Smartphone level …… ..etc. all its applications depend on the conditions of use in which

they are implemented, to be done and to overcome the puzzles of imperfection it is necessary to be sure to

properly characterize the speech signal by extracting the most relevant characters such as: the fundamental

frequency (pitch in English), timbre, tonality, to extract them many techniques are possible, the most used of

which are acoustic such as: MFCC, PLP, LPC, RASTA and other in the form of combination (hybridization)

namely: PLP RASTA, MFCC PLP …………… etc. They are used in data transmission, speaker recognition

and even in speech synthesis.

Keywords: speech signal, Parametrisation, SVM, pathological voices, classifier, MFCC, PLP, RASTA, LPC

Received: March 15, 2022. Revised: October 11, 2022. Accepted: November 15, 2022. Published: December 31, 2022.

1. Introduction:

The extraction of acoustic parameters or

characteristics, such as fundamental frequency,

formants, etc. is done by applying signal

processing methods which are for example: time-

frequency analysis, spectral analysis, Cepstral

analysis..etc Parameterization constitutes the initial

block (fig.2) for any recognition of a speech

signal, its role is to extract from a speech signal the

most relevant information possible in order to be

able to make a separation between the sounds [8].

The extracted information is presented as a

sequence of acoustic vectors. In order to be able to

extract these parameters, several methods exist,

taking into account the superposition of the noises

of the sounds, we will make a comparison of the

different methods (MFCC, PLP, PLP RASTA, and

the combination of several other parameters such

as LPC, pitch, forming, energy). Given the

redundancy of the speech signal and its

complexity, to process it, different methods are

admitted to have a better parameterization. In this

paper we will give a brief overview on the signal

processing tools such as short-term energy and

weighting windows,then see the different speech

signal parameterization methods which are: LPC

(Linear Predictive Coding) analysis,

Homomorphic or Cepstral analysis on which the

MFCC (Mel Frequency Cepstral Coefficient) is

based, PLP (Predictive Linear Perceptual) and

PLP-RASTA (Real Ative SpecTrA).this involves

using an SVM classification to distinguish

between speech signals from people with speech

pathology (Nodule or Oedeme) and normal signals

(no pathology). In this paper, two types of

classifications have been used:

- A classification in two classes: in which we used

samples of corpus from pathological signals

(Nodule and Oedema) and another from normal

signals to make this classification, (see Fig.)

Shows us the principle of classification.

- A multi-class classification: in which we took

samples of each type of pathology among the two

that we have (Nodule and Oedme) to constitute the

first and second classes and samples from normal

signals, which is for the third class. To perform a

multi-class classification, we used a One VS all

type of algorithm, that is to say "one against all",

International Journal on Applied Physics and Engineering

DOI: 10.37394/232030.2023.1.2

Hadji Salah

E-ISSN: 2945-0489

Volume 1, 2022

this algorithm consists of taking a signal and

comparing it with all the classes.

2. Parameterization methods

There are several methods of parametrization;

there are those, which based on the perception of

the human ear like the MFCC, PLP and others,

which are interested in the model of speech

production such as the Cepstral method and the

LPC.

Fig.2. Classification algorithm

2.1 Cepsral Frequency Coefficients on the

Mel scale (MFCC):

Obtaining the Cepstral Coefficients at the Mel

scale was developed in 1980 by Davis and

Mermelstein, to do this it is necessary to apply a

Hamming window to each frame of the signal, we

then obtain a Cepstral characteristic vector per

frame, then we apply the Discrete Fourier

Transform (DFT). let us then keep the Logarithm

of the amplitude spectrum, then after smoothing

the spectrum let us apply the Discrete Cosine

transform to have the Cepstral coefficients (see

figure).

Fig .1. Mel coefficient calculation process

The extraction of the MFCC coefficients consists

of six steps as mentioned in the previous figure

(Fig. 1) [6]:

Step 1: Pre-emphasis:

This step in the process is to emphasize the high

frequencies, this result in increasing the energy at

the higher frequencies.

Y (n) =X[n]-a. X [n-1] (1)

Stage 2: Segmentation into frames: this stage

consists in fragmenting the signal into frames of

20 to 40 ms. The speech signal is split into N

samples. Adjacent samples are spaced by M

(M<N), typically the values used are M=100 and

N=256.

Step 3: Windowing with Hamming:

Discontinuities related to segmentation can be

overcome by multiplying each frame by a

Hamming window. The Hamming window is

given by the following equation:

If the window is defined as W (n), 0≤n≤N-1 such

that:

N: number of samples in each frame

Y[n]: Output signal

X[n]: Input signal

W (n): The Hamming window, so the result will

be:

󰇛󰇜󰇛󰇜󰇛󰇜 (2)

W (n) =0.54-0.46 cos (2πn/ (N-1)) (3)

International Journal on Applied Physics and Engineering

DOI: 10.37394/232030.2023.1.2

Hadji Salah

E-ISSN: 2945-0489

Volume 1, 2022

0≤n≤N-1

Step 4: The fast or short-term Fourier transform:

To go from the time domain to the spectral

domain, a Fourier transform is applied to each

frame of N samples. The FFT is shown at the

bottom:

Y(w)=FFT[h(t)*x(t)] =H(w)×X(w) (4)

Step 5: Mel Filter Bank

Step 6: Application of the iDCT (Inverse Discrete

Cosine Transform)

2.2 LPC (Linear Predictive Coding)

LPC analysis is based on the speech production

model mentioned in the figure below [7]. Starting

from the hypothesis modeling the speech by a

linear process, then It is a linear prediction at an

instant n of the p previous samples. However, the

non-linearity of speech requires the existence of an

error denoted e(n) introduced to correct this error

[2].

Fig.3. Speech production model

The LPC consists in calculating the coefficients ak

by minimizing the error. The following equation

presents the process:

󰇛󰇜󰇛󰇜



 󰇛󰇜 (5)

The preacher's equation is:

󰆒󰇛󰇜



 󰇛󰇜 (6)

The prediction error is calculated by the following

equation:

󰇛󰇜󰇛󰇜󰆒󰇛󰇜

󰇛󰇜󰇛󰇜



 . (7)

The problem that troubles researchers is: how to

determine p “optimal” coefficients ak knowing N

samples of a certain signal x[n] such that the error

e(n) is the smallest possible. To do this, we

minimize the energy of the prediction error e (n),

over the duration of the block of length N. So we

need to minimize:

󰇟󰇠



 󰇧󰇟󰇠





 󰇟󰇠󰇨



 2

(8) We get there by setting ∂E/∂ak =0 and for

each ak. This generates a system of p equations

with p unknowns (the ak), which can then be

solved to obtain the ak. The system of equations

that will allow us to calculate the coefficients ak is:

Fig.4. Yule-Walker matrix

Such as:

󰇛󰇜󰇛󰇜󰇛󰇜



 (9)

The transfer function of the filter is determined by

the following equation:

󰇛󰇜󰇛󰇜

󰇛󰇜



  (10)

Impuls

International Journal on Applied Physics and Engineering

DOI: 10.37394/232030.2023.1.2

Hadji Salah

E-ISSN: 2945-0489

Volume 1, 2022

2.3. the PLP technique

PLP (Perceptual Linear Prediction) is a

parametrization technique based on the human

auditory system, it is an improvement of the one

named LPC which estimates the spectrum over the

entire audible band and can miss certain spectral

details. The PLP estimates the parameters of an

all-pole autoregressive filter, allowing a better

modeling of the auditory spectrum by introducing

critical bands at the level of the power spectrum

with a bank of 17 filters whose central frequencies

are linearly spaced according to the Bark scale

which simulates the perception of the human ear

[3, 4], whose audible frequencies range

approximately from 20 Hz to 22 kHz much closer

to perception than the linear Hertz scale (1 Bark =

100 Mels) [4].

Denoised speech

Fig.4. PLP coefficients

2.4. the PLP RASTA technique:

PLP RASTA is a hybrid parametrization technique

between Perceptual Linear Prediction (PLP) and

Relative Spectral Prediction (RASTA). The

RASTA technique allows the identification of the

(interesting) zones by comparing the temporal

evolution of the spectral components with respect

to the vocal tract and removes the others that do

not correspond to them, which are not speech

(noise), or the signal speech is often stained with

noise having a slow variation, RASTA uses a bank

of filters eliminating stationary signals, this

technique makes it possible to reduce the

sensitivity of speech analysis in the face of slow

changes, a band-pass filter is applied to each

spectral component according to a frequency

representation in the critical band. The transfer

function is:

󰇛󰇜 󰇡

 󰇢 (5)

This method gives results against distortions and

its lower quality for additive noises [8].

3. Experimental Results

Two essential steps to carry out the classification

of the pathological paths and those healthy, to be

done the first step is the parametrization (matrix of

the relevant parameters), or still acoustic vectors

extracted starting from corpus of sounds of the

TIMIT database and from other people with vocal

pathologies, these vectors are injected at the input

of the SVM classifier, the first step is learning ",

and the second step is the test, this is why the

validation base is divided into two sub-bases one

for learning (3/4) and one for testing (1/4). After a

certain number of executions of the two stages, we

can distinguish the voices of healthy people from

the voices of people who have difficulties during

the production of speech (cold for example). In the

following, we present the different analyzes:

-The LPC analysis represents the speech signal by

these LPC linear predictive coding coefficients and

is carried out in 4 steps:

International Journal on Applied Physics and Engineering

DOI: 10.37394/232030.2023.1.2

Hadji Salah

E-ISSN: 2945-0489

Volume 1, 2022

Fig.5. LPC analysis of a few samples of a signal

Then we present the broad and narrow band

spectrogram

- The broadband spectrogram which is obtained

with a window of short duration (3 ms in our

project), it makes it possible to follow the

evolution of the formants, the voiced periods

appear there in the form of dark bands which are

vertical.

- The narrow band spectrogram: it is obtained with

a larger window (30 ms), it makes possible to

visualize the harmonics of the signal in the voiced

zones, and they appear in the form of horizontal

bands

Fig.6. Representation of broadband (right) and

narrowband spectrograms

Fig.7. Formants display

- PLP and PLP-RASTA technique analysis

Fig.8. PLP and PLP-RASTA analysis

In the following we will present the

parametrization matrices of the different

techniques and establish a comparison

- Parametric matrix 1:

This matrix is the first that we will use in the

classification in order to make a comparison

between the performances of the different methods

of parameterization.

This matrix contains 4 columns and 200 rows, the

columns contain the parameter types and the rows

contain the values:

 First column: this column contains

samples of the signal

 Second column: this second column

contains the short-term energy of the

signal.

 Third column: contains the cepstral

coefficients.

International Journal on Applied Physics and Engineering

DOI: 10.37394/232030.2023.1.2

Hadji Salah

E-ISSN: 2945-0489

Volume 1, 2022

 Fourth: This column contains the first 12

cepstral coefficients plus the pitch (F0)

and the first three formants (F1 F2 F3)

The following figure shows this matrix:

Fig.9. Parameterization matrix 1

o Parametric matrices 2

This part groups together the 3 most used

parameterization methods in all that is voice

recognition. Each represented by a matrix.

The dimension of these matrices is 13 columns and

214 rows. That is 214 frames or vectors and each

with 13 coefficients.

 First matrix: this matrix contains the PLP

coefficients without the RASTA filtering.

 Second matrix: it contains the MFCC

coefficients.

 Third matrix: This third and last contains

the PLP-RASTA coefficients.

The Fig.9, presents these 3 matrices:

Fig.9. Classification of pathological signals VS

Normals

It is a question of using an SVM classification in

order to make a distinction between speech signals

coming from people who suffer from vocal

pathology (Nodule or Oedema) and normal signals

(no pathology).

In our application, two types of classifications

were used:

- A classification into two classes: in which we

used corpus samples from pathological signals

(Nodule and Oedema) and others from normal

signals to make this classification, the figure (Fig.)

shows us the principle of classification.

- A multi-class classification: in which we took

samples from each type of pathology among the

two we have (Nodule and Oedma) to constitute the

first and second classes and samples from normal

signals which is for the third class. To perform a

multi-class classification, we used a OneVSall-

type algorithm, i.e. "one against all", this algorithm

consists of taking a signal and comparing it with

all the classes.

1. Learning phase

The learning phase consists of creating a basic

model on which the subsequent classification of

signals is based.

This involves taking a speech signal, extracting its

coefficients (MFCC, PLP or PLP-RASTA) and

applying the function dedicated to learning

Fig. 10. Learning phase

The red color corresponds to the parameters

extracted from a signal of a healthy individual (0)

International Journal on Applied Physics and Engineering

DOI: 10.37394/232030.2023.1.2

Hadji Salah

E-ISSN: 2945-0489

Volume 1, 2022

and the green color corresponds when it to the

pathological signals (1)

2. Test phase

The test phase consists of recovering the matrix

resulting from learning in order to predict or

generate a decision, arguments of SVM classifier:

- Train matrix or learning matrix:

- Data N: is a matrix of the same size as the matrix

used when learning the model, it is a matrix of data to

be classified. Thus the system displays a message to

say that the voice comes either from a healthy person

in terms of voice production or suffers from a

pathology.

The following figure shows an overview of a

classified signal.

Fig.11. Example of a classified signal

The following table shows the results obtained

after applying several parametrization methods, it

should be noted that the signals used include male

and female voices.

Table 1. number of signals for validation from

TIMIT

Pathological

signals

Normal

signals

Total

signal

Training

signals

Nodule

Oeudem

Table 2. Results of recognition

Recognition rate

/2classes

Multi-class

recognition rate

PLP

MFCC

PLP

MFCC

88.23%

76.4%

85%

58%

In this table we have not presented the PLP

RASTA method, because the latter classifies all

the signals as being normal, and this leads us to

say that the PLP RASTA eliminates the noise

which caused the pathological signals to be

considered as such, from suddenly these signals

become normal following RASTA filtering.

4. Conclusion:

In this paper we have developed an application in

Matlab which aims to make a parametrization in

order to perform a recognition of pathological

voices.

This recognition is done using the SVM

classification with several types of acoustic vectors

(PLP, MFCC, and PLP-RASTA). According to the

results obtained during the tests, we were able to

observe that the parameters generated by the PLP-

RASTA method give a less satisfactory result

compared to the other two methods.

References

[1] René Boite, Hervé Boulard, Thierry Dutoit,

Joël Hancq and Henri Leich, “speech processing

book (ISNB: 2-88074-388-5)” Presses

Polytechniques et Universitaire Romandes. 2000.

[2] : http://fr.wikipedia.org/wiki/Fenêtrage

[3] Zied Hajaiej, Kais Ouni, Noureddine Ellouze,

“Speech Parametrization Based on Cochlear Filter

Modeling: Application to PAR”. Article from the

Signal Processing and Systems Laboratory

(LSTS). ENIT.

[4] Julien PINQUIER “Sound indexing: search

for primary components for audiovisual

structuring” Thesis in Computer Science.

University of Toulouse III – Paul Sabatier.

December 20, 2004

[5] Houda HOSNI, Zied SAKKA, Abdennaceur

KACHOURI and Mounir SAMET.« Étude de la

Paramétrisation RASTA PLP en vue de la

Reconnaissance Automatique de la Parole Arabe »

[6] Linda salwaMuda, MumtajBegam et I.

Elamvazuthi « Voice Recognition Algorithms

using Mel Frequency Cepstral Coefficient (MFCC)

and Dynamic Time Warping (DTW)

Techniques »Paper from the Journal of

Computing, Volume 2, Issue 3. 3 mars 2010

International Journal on Applied Physics and Engineering

DOI: 10.37394/232030.2023.1.2

Hadji Salah

E-ISSN: 2945-0489

Volume 1, 2022

[7] CHERIF Adnen: Faculty of Sciences of Tunis

El Manar “Audio processing and transmission

course: digitization of the audio signal”

[8] Houda Hosni, Zied Sakka, Abdennaceur

Kachouri and Mounir Samet. "Study of the Rasta

PLP configuration for automatic recognition of

Arabic speech", LETI laboratory of National

School of Engineers in Sfax. 2009

Creative Commons Attribution License 4.0

(Attribution 4.0 International, CC BY 4.0)

This article is published under the terms of the Creative

Commons Attribution License 4.0

https://creativecommons.org/licenses/by/4.0/deed.en_US

International Journal on Applied Physics and Engineering

DOI: 10.37394/232030.2023.1.2

Hadji Salah

E-ISSN: 2945-0489

Volume 1, 2022