Human pronunciation can be divided into voiced and

unvoiced sound according to the vibration of vocal cords. The

vocal cords do not vibrate when air flow through the opened

vocal cords without obstruction. This will produce unvoiced

sound. Voiced sound is produced when the vocal cords close

and air flow to make the vocal cords vibrate. If the processing

of the speech only relies on the overall synthesis features of the

signal, it will inevitably blur the characteristics of the two

components in the speech (i.e. the unvoiced and the voiced).

Applying the classification to speech signal processing will

solve the problem that vowels and consonants have different

time-frequency resolution requirements [1-3]. At the same time,

the corresponding adaptability and function of speech

recognition can be enhanced. At present, there are several the

following the classical classification methods of unvoiced and

voiced sounds [4]. We can set eigenvalue thresholds based on

the difference in the short-term energy of the two sounds. And

on this basis improved judgments is based on short-term energy

distribution characteristics [5]. But with a large amount of

computation and complex implementation. Or we can use the

method of short-term zero-crossing rate judgment [6]. Either

way the accuracy of these methods is unsatisfactory. Since the

1980s, artificial neural networks have also been introduced into

this field, but the training speed is slow and it is easy to fall into

local points [7-8].

In this paper, an unvoiced and voiced sound segmentation

algorithm based on the estimation of the short-term linear

subspace dimension of speech is designed. First, the overall

principal component analysis of different monophones is

carried out [9-10]. We can find the principal component number

of unvoiced and voiced sounds, that is, the dimension, has

different trends with the frame length. On the basis of this

research, the local principal component analysis of continuous

speech is continued [11-12]. According to the change of the

number of signal dimensions over time, it reflects which time

period is voiced and which time period is unvoiced. The method

utilizes the difference in the number of principal components of

unvoiced and voiced signals to obtain a way of judging. This

method has good real-time performance and high accuracy.

PCA transforms the original data into a linearly

independent representation of each dimension through linear

transformation. This a chieve the effect of dimensionality

reduction [13-15]. Simultaneously, it can be used to extract the

main feature components of the data. PCA transform, also

known as Hotelling transform or K-L transform, is an

orthogonal linear transform. The transform is understood as

using linear projection to project the data into the subspace with

the smallest dimension. So that the obtained components are

distributed according to the amount of information. The amount

of information contained in the first principal component is the

largest, and it decreases in turn in the backward direction. And

there is no correlation between the principal component

components after transformation. The information of the image

after PCA transformation is mainly concentrated in the first few

principal components. Generally, the components with small

amount of information are discarded until the amount of

information is greater than 90%~95%. Each eigenvector of the

data covariance matrix is a subspace coordinate vector, and its

corresponding eigenvalue is the variance of the initial signal

projected onto the projection surface.

The algorithm flow of PCA analysis of speech signal is as

follows:

First divide the speech signal into M frames, each frame

has N dimensions, and the n-th dimension element of the m-th

frame is denoted as n

Step 1: Decentralize all features, that is, find the average of

each dimension, and then subtract its own mean from each

feature

Unvoiced and Voiced Speech Segmentation Based on the Dimension of

Signal Local Linear Manifold

ZHAOTING LIU

Electronic Information

College

Qingdao University,

Qingdao, CHINA

ZHONGXIAO LI

Electronic Information

College

Qingdao University

Qingdao, 266071, CHINA

XIAODONG ZHUANG

Electronic Information

College

Qingdao University

Qingdao, 266071, CHINA

NIKOS MASTORAKIS

Technical University of

Sofia

Sofia, 1000, BULGARIA

Abstract—A new method of unvoiced and voiced speech segmentation is proposed from the perspective of local

linear manifold analysis of speech signal. It is based on the estimation of the dimension of short-time linear

subspace. The subspace dimensional characteristics of the single phoneme signal is studied. The local signal

vector set is analyzed by using PCA algorithm to estimate the dimension of the data matrix formed by framing.

The local PCA is used to analyze the speech signal to achieve the segmentation of unvoiced and voiced

pronunciation. Simulation experiments prove the effectiveness of the proposed method.

Keywords—subspace dimension local PCA unvoiced and voiced speech segmentation

Received: April 19, 2021. Revised: February 20, 2022. Accepted: March 24, 2022. Published: April 28, 2022.

1. Introduction

2. PCA Algorithm

WSEAS TRANSACTIONS on SIGNAL PROCESSING

DOI: 10.37394/232014.2022.18.9

Zhaoting Liu, Zhongxiao Li,

Xiaodong Zhuang, Nikos Mastorakis

E-ISSN: 2224-3488

Volume 18, 2022

The n-th dimension mean:



 (1)

The original signal becomes:

nnn

XX

(2)

Step 2: Find the covariance matrix

Variance in two dimensions:



12 1

cov(X , X ) 1

Mnn

nn i







(3)

Covariance C:

11 1

cov(X , X ) . . . cov(X , X )

NNN











(4)

Step 3: Find the eigenvalues



of the covariance matrix C and

the corresponding eigenvectors







(5)

Among them, there are a total of N eigenvalues, and they

are arranged from large to small to select the first k to g et



...

Step 4: Project the original feature onto the selected feature

vector and use

to represent the kth dimension of the m-th

frame.



.. , ,...,

mm m

kkT

XX X



 







 





 



 





(6)

Step 5: Find the proportion of information in each dimension

nNj







(7)

The value of an eigenvalue divided by the sum of all

eigenvalues is the variance contribution rate of the eigenvector.

It represents the proportion of the amount of information

contained in this dimension.

Fig. 1 Flow chart of PCA algorithm

Local PCA is to take some frames that are close in time near

a certain moment to do PCA. Instead of doing PCA for all

frames of the whole signal. Do local PCA analysis of continuous

speech signals containing different pronunciation phonemes.

The purpose is to check the local the number of principal

components of the signal vector set changes over time. Then

unvoiced and voiced sounds can be determined.

First, the frame offset is used as a variable. In order to

increase the signal vector, the frame offset of adjacent frames

can be smaller. Since PCA is a statistical method, more sample

vectors are required. If the frame offset is too large, there will be

too few signal vectors in the local temporal neighborhood, and

the PCA results will lose statistical significance. But the frame

offset is too small may also bring some problems such as

increasing the amount of calculation.

From the experience of the ordinary speech rate of

continuous speech signals, the duration of about 16 ms to 32 ms

corresponds to one pronunciation phoneme. Therefore, the local

time interval is 20 to 30 ms. Under the sampling frequency of 16

kHz, it is converted into the length of the number of sampling

points. That is to say that the length of the local interval is 320

to 480 sampling points. First, the frame length, frame offset, and

local interval should be set. And then a range of local intervals

should be taken from the beginning of the signal. Next, the local

signal will be framed to form a data matrix, and a complete PCA

analysis should be performed to obtain the number of principal

3. Continuous Speech Segmentation

Method Based on Local PCA

WSEAS TRANSACTIONS on SIGNAL PROCESSING

DOI: 10.37394/232014.2022.18.9

Zhaoting Liu, Zhongxiao Li,

Xiaodong Zhuang, Nikos Mastorakis

E-ISSN: 2224-3488

Volume 18, 2022

components. Repeat the above steps starting with the second

frame, the number of principal components that change with

time will be obtained. And the unvoiced and voiced sounds can

be effectively segmented by taking a certain threshold for the

result.

The overall PCA analysis of the monophone signal was

carried out to observe the change of the number of principal

components of the signal with the frame length under different

frame lengths. And we can know the difference between

different signals.

Fig. 2 PCA analysis of monophone signals

The results are shown in the figure below:

Fig. 3 The number of principal components varies with frame length (More

than 90% composition)

Fig. 4 The number of principal components varies with frame length (More

than 95% composition)

In Fig. 3 and Fig. 4, the abscissa represents the frame length,

and the ordinate represents the number of principal components.

The main components of some phonemes vary with the

frame length as shown in the tables below:

ABLE

HE NUMBER OF PRINCIPAL COMPONENTS VARIES WITH FRAME LENGTH

(

ORE THAN

90%

COMPOSITION

)

Frame

length

Signal

16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256

/u/

2 3 4 5 6 7 7 8 8 8 8 8 8 8 8 8

/o/

2 3 4 5 6 7 7 8 8 8 8 8 8 8 9 9

/sh/

6 11 15 19 23 28 31 35 39 43 46 50 52 55 58 61

/s/

4 6 9 11 13 16 18 20 22 24 26 28 31 32 34 36

16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256

Frame length

Number of principal components

devoiced-e

devoiced-a

Number of principal components

4. Experimental Simulation Analysis

4.1 PCA Analysis of Monophone Signals

WSEAS TRANSACTIONS on SIGNAL PROCESSING

DOI: 10.37394/232014.2022.18.9

Zhaoting Liu, Zhongxiao Li,

Xiaodong Zhuang, Nikos Mastorakis

E-ISSN: 2224-3488

Volume 18, 2022

ABLE

HE NUMBER OF PRINCIPAL COMPONENTS VARIES WITH FRAME LENGTH

ORE THAN

95%

COMPOSITION

)

Frame

length

Signal

16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256

/u/

2 4 5 6 7 8 8 9 9 9 9 9 9 9 9 10

/o/

3 4 5 7 8 9 10 11 11 12 12 12 12 13 13 13

/sh/

8 14 19 24 29 35 40 44 49 54 58 63 66 70 74 77

/s/

5 9 12 15 18 21 24 27 30 33 36 38 41 43 45 47

It can be seen from the figure and table that for voiced

sounds, it tends to a limit value, while for unvoiced sounds, it

increases approximately linearly with the increase of frame

length. Under the same frame length, the number of principal

components of different phoneme pronunciation signals is

different.

Based on the above results, a method for segmenting

different phonemes of continuous speech signals based on local

PCA analysis is proposed. That is, t aking some temporally

closed frames near a certain moment for PCA to check the

number of principal components of the local signal vector set

with the passage of time changes. Thus, the voiced and unvoiced

phonemes in word pronunciation can be judged and segmented.

Fig. 5 Flowchart of local PCA over time

The following figures show the local PCA analysis of the

three words signals of ‘face; show; wash’. The frame length is

128, the frame offset is 4, and the local range is 400 points. A

certain threshold (boundary value) is taken for the result curve.

Time-domain waveforms are compared to the results produced.

Fig. 6 Local PCA segmentation results of 'show' word signal (more than 90%

components)

Fig. 7 Local PCA segmentation results of 'show' word signal (more than 95%

components)

Fig. 8 Local PCA segmentation results of 'face' word signal (more than 90%

components)

0 500 1000 1500 2000 2500

0 2000 4000 6000 8000 10000 12000

-0.2

-0.1

0.1

0.2

0 500 1000 1500 2000 2500

0 2000 4000 6000 8000 10000 12000

-0.2

-0.1

0.1

0.2

0 500 1000 1500 2000

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

-0.5

0.5

4.2 Local P&$ Analysis of Continuous

Speech Signals

WSEAS TRANSACTIONS on SIGNAL PROCESSING

DOI: 10.37394/232014.2022.18.9

Zhaoting Liu, Zhongxiao Li,

Xiaodong Zhuang, Nikos Mastorakis

E-ISSN: 2224-3488

Volume 18, 2022

Fig. 9 Local PCA segmentation results of 'face' word signal (more than 95%

components)

Fig. 10 Local PCA segmentation results of 'wash' word signal (more than 90%

components)

Fig. 11 Local PCA segmentation results of 'wash' word signal (more than 95%

components)

The upper part of each image is the signal time-domain

waveform diagram, and the lower part is the graph of the number

of principal components finally obtained over time. The vertical

direction of the two images corresponds to the same time.

The thresholds of the number of principal components are

taken as 13 (take more than 90% components) and 18 (take more

than 95% components). Only from the distinction between

unvoiced and voiced sounds, it can be seen from the above result

graph that the position of the red vertical line in the figure can

be accurately correspond to the segmentation of unvoiced and

voiced sounds in the time domain waveform. That is, the method

can segment unvoiced and voiced sounds. However, in the

"face" signal, it can be seen that there is a silent signal in the

signal. Although the voiced and unvoiced sounds can still be

segmented, the existence of the silent signal cannot be

distinguished. Therefore, the silent signal should be extracted

first, and then segmented by the method in this paper.

This paper firstly studies the relationship between the

number of principal components and the frame length after the

monophone signal is divided into frames and reduced in

dimension. As the frame length increases, the number of

principal components tends to a limit for voiced sounds, while

for unvoiced sounds, the number of principal components

increases approximately linearly. And under the same frame

length, the number of principal components of different

phoneme pronunciation signals is different. Further research on

continuous speech segmentation by local PCA is carried out.

That is, the set of speech frames that are very close in time is

used for PCA analysis, and the graph of the number of local

principal components over time is obtained and compared with

the time-domain waveforms. It is found that the segmentation

of voiced and unvoiced sounds can be effectively performed by

setting the threshold. Future research will be carried out from

the segmentation of silent segments and unvoiced or voiced

sounds. We will strive to achieve high-accuracy real-time

segmentation for it that is different from traditional methods.

[1] Qizheng Huang, Changchun Bao, Xianyun Wang, Yang Xiang, "Speech

enhancement method based on multi-band excitation model ", Applied

Acoustics, 2020, Volume 163.

[2] J. Yang, Z. Li, and P. Su,“Review of speech segmentation and endpoint

detection,”Journal of Computer Applications, 2020, pp.1-7.

[3] D. Ridha and S. Suyanto, "Removing Unvoiced Segment to Improve Text

Independent Speaker Recognition," 2019 International Seminar on

Research of Information Technology and Intelligent Systems (ISRITI),

2019, pp. 50-53.

[4] A.K.Alimuradov, "Enhancement of Speech Signal Segmentation Using

Teager Energy Operator," 2021 23rd International Conference on Digital

Signal Processing and its Applications (DSPA), 2021, pp. 1-7.

[5] A.K. Alimuradov, "Speech/Pause Segmentation Method Based on Teager

Energy Operator and Short-Time Energy Analysis," 2021 Ural

Symposium on Biomedical Engineering, Radio electronics and

Information Technology (USBEREIT), 2021, pp. 0045-0048.

[6] R.Bachu, S. Kopparthi,B. Adapa, and B. Barkana,"Separation of voiced

and unvoiced using zero crossing rate and energy of the speech signal,”

American Society for Engineering Education(ASEE) Zone Conference

Proceedings,2008,pp. 1-7.

[7] K. Struwe, "Voiced-Unvoiced Classification of Speech Using a Neural

Network Trained with LPC Coefficients," 2017 International Conference

on Control, Artificial Intelligence, Robotics & Optimization (ICCAIRO),

2017, pp. 56-59.

[8] M. Musaev, I. Khujayorov and M. Ochilov, "The Use of Neural Networks

to Improve the Recognition Accuracy of Explosive and Unvoiced

Phonemes in Uzbek Language," 2020 Information Communication

Technologies Conference (ICTC), 2020, pp. 231-234.

[9] Herve Cardot, David Degras, "Online principal component analysis in

high dimension: Which algorithm to choose?” International Statistical

Review, 2018,pp.29-50.

0 500 1000 1500 2000

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

-0.5

0.5

0 200 400 600 800 1000 1200 1400 1600 1800 2000

0 1000 2000 3000 4000 5000 6000 7000 8000

-0.2

-0.1

0.1

0.2

0 200 400 600 800 1000 1200 1400 1600 1800 2000

0 1000 2000 3000 4000 5000 6000 7000 8000

-0.2

-0.1

0.1

0.2

5. Conclusion

References

WSEAS TRANSACTIONS on SIGNAL PROCESSING

DOI: 10.37394/232014.2022.18.9

Zhaoting Liu, Zhongxiao Li,

Xiaodong Zhuang, Nikos Mastorakis

E-ISSN: 2224-3488

Volume 18, 2022

[10] S. Xiangbo and T. Wei, "Research on Multidimensional User Experience

Evaluation Model Based on Principal Component Analysis," 2020 IEEE

International Conference on Power, Intelligent Computing and Systems

(ICPICS), 2020, pp. 554-557.

[11] S. Alakkari and J. D ingliana, "Modelling Large Scale Datasets Using

Partitioning-Based PCA," 2019 IEEE International Conference on Image

Processing (ICIP), 2019, pp. 2646-2650.

[12] F. Jing, H. Shaohai and M. Xiaole, "SAR image de-noising via grouping-

based PCA and guided filter," in Journal of Systems Engineering and

Electronics, 2021, pp. 81-91.

[13] Z. Xia, Y. Chen and C. Xu, "Multiview PCA: A Methodology of Feature

Extraction and Dimension Reduction for High-Order Data," in IEEE

Transactions on Cybernetics.

[14] I.T. Jolliffe, Principal Component Analysis, 2nd ed. New York, NY, USA:

Springer-Verlag, 2002.

[15] J. Ye, R. Janardan, and Q.Li, "GPCA: An efficient dimension reduction

scheme for image compression and retrieval," in Proc. KDD,2004,

pp.354-363.

Creative Commons Attribution License 4.0

(Attribution 4.0 International, CC BY 4.0)

This article is published under the terms of the Creative

Commons Attribution License 4.0

https://creativecommons.org/licenses/by/4.0/deed.en_US

WSEAS TRANSACTIONS on SIGNAL PROCESSING

DOI: 10.37394/232014.2022.18.9

Zhaoting Liu, Zhongxiao Li,

Xiaodong Zhuang, Nikos Mastorakis

E-ISSN: 2224-3488

Volume 18, 2022