Human pronunciation can be divided into voiced and
unvoiced sound according to the vibration of vocal cords. The
vocal cords do not vibrate when air flow through the opened
vocal cords without obstruction. This will produce unvoiced
sound. Voiced sound is produced when the vocal cords close
and air flow to make the vocal cords vibrate. If the processing
of the speech only relies on the overall synthesis features of the
signal, it will inevitably blur the characteristics of the two
components in the speech (i.e. the unvoiced and the voiced).
Applying the classification to speech signal processing will
solve the problem that vowels and consonants have different
time-frequency resolution requirements [1-3]. At the same time,
the corresponding adaptability and function of speech
recognition can be enhanced. At present, there are several the
following the classical classification methods of unvoiced and
voiced sounds [4]. We can set eigenvalue thresholds based on
the difference in the short-term energy of the two sounds. And
on this basis improved judgments is based on short-term energy
distribution characteristics [5]. But with a large amount of
computation and complex implementation. Or we can use the
method of short-term zero-crossing rate judgment [6]. Either
way the accuracy of these methods is unsatisfactory. Since the
1980s, artificial neural networks have also been introduced into
this field, but the training speed is slow and it is easy to fall into
local points [7-8].
In this paper, an unvoiced and voiced sound segmentation
algorithm based on the estimation of the short-term linear
subspace dimension of speech is designed. First, the overall
principal component analysis of different monophones is
carried out [9-10]. We can find the principal component number
of unvoiced and voiced sounds, that is, the dimension, has
different trends with the frame length. On the basis of this
research, the local principal component analysis of continuous
speech is continued [11-12]. According to the change of the
number of signal dimensions over time, it reflects which time
period is voiced and which time period is unvoiced. The method
utilizes the difference in the number of principal components of
unvoiced and voiced signals to obtain a way of judging. This
method has good real-time performance and high accuracy.
PCA transforms the original data into a linearly
independent representation of each dimension through linear
transformation. This a chieve the effect of dimensionality
reduction [13-15]. Simultaneously, it can be used to extract the
main feature components of the data. PCA transform, also
known as Hotelling transform or K-L transform, is an
orthogonal linear transform. The transform is understood as
using linear projection to project the data into the subspace with
the smallest dimension. So that the obtained components are
distributed according to the amount of information. The amount
of information contained in the first principal component is the
largest, and it decreases in turn in the backward direction. And
there is no correlation between the principal component
components after transformation. The information of the image
after PCA transformation is mainly concentrated in the first few
principal components. Generally, the components with small
amount of information are discarded until the amount of
information is greater than 90%~95%. Each eigenvector of the
data covariance matrix is a subspace coordinate vector, and its
corresponding eigenvalue is the variance of the initial signal
projected onto the projection surface.
The algorithm flow of PCA analysis of speech signal is as
follows:
First divide the speech signal into M frames, each frame
has N dimensions, and the n-th dimension element of the m-th
frame is denoted as n
m
X.
Step 1: Decentralize all features, that is, find the average of
each dimension, and then subtract its own mean from each
feature
Unvoiced and Voiced Speech Segmentation Based on the Dimension of
Signal Local Linear Manifold
ZHAOTING LIU
Electronic Information
College
Qingdao University,
Qingdao, CHINA
ZHONGXIAO LI
Electronic Information
College
Qingdao University
Qingdao, 266071, CHINA
XIAODONG ZHUANG
Electronic Information
College
Qingdao University
Qingdao, 266071, CHINA
NIKOS MASTORAKIS
Technical University of
Sofia
Sofia, 1000, BULGARIA
Abstract—A new method of unvoiced and voiced speech segmentation is proposed from the perspective of local
linear manifold analysis of speech signal. It is based on the estimation of the dimension of short-time linear
subspace. The subspace dimensional characteristics of the single phoneme signal is studied. The local signal
vector set is analyzed by using PCA algorithm to estimate the dimension of the data matrix formed by framing.
The local PCA is used to analyze the speech signal to achieve the segmentation of unvoiced and voiced
pronunciation. Simulation experiments prove the effectiveness of the proposed method.
Keywords—subspace dimension local PCA unvoiced and voiced speech segmentation
Received: April 19, 2021. Revised: February 20, 2022. Accepted: March 24, 2022. Published: April 28, 2022.
1. Introduction
2. PCA Algorithm
WSEAS TRANSACTIONS on SIGNAL PROCESSING
DOI: 10.37394/232014.2022.18.9
Zhaoting Liu, Zhongxiao Li,
Xiaodong Zhuang, Nikos Mastorakis
E-ISSN: 2224-3488
64
Volume 18, 2022
The n-th dimension mean:
m
1
1
nn
i
i
XX
m
(1)
The original signal becomes:
nnn
mm
X
XX
(2)
Step 2: Find the covariance matrix
Variance in two dimensions:

12
12 1
cov(X , X ) 1
Mnn
ii
nn i
XX
M
(3)
Covariance C:
11 1
1
cov(X , X ) . . . cov(X , X )
..
..
..
cov(X , X ) . . . cov(X , X )
N
NNN
C








(4)
Step 3: Find the eigenvalues
of the covariance matrix C and
the corresponding eigenvectors
.
C

(5)
Among them, there are a total of N eigenvalues, and they
are arranged from large to small to select the first k to g et

1
...
k
uu
.
Step 4: Project the original feature onto the selected feature
vector and use
k
m
y
to represent the kth dimension of the m-th
frame.

11
22
12
.. , ,...,
..
..
T
m
T
m
T
n
mm m
kkT
m
yu
yu
XX X
yu













(6)
Step 5: Find the proportion of information in each dimension
n
e
1
n
nNj
j
e
(7)
The value of an eigenvalue divided by the sum of all
eigenvalues is the variance contribution rate of the eigenvector.
It represents the proportion of the amount of information
contained in this dimension.
Fig. 1 Flow chart of PCA algorithm
Local PCA is to take some frames that are close in time near
a certain moment to do PCA. Instead of doing PCA for all
frames of the whole signal. Do local PCA analysis of continuous
speech signals containing different pronunciation phonemes.
The purpose is to check the local the number of principal
components of the signal vector set changes over time. Then
unvoiced and voiced sounds can be determined.
First, the frame offset is used as a variable. In order to
increase the signal vector, the frame offset of adjacent frames
can be smaller. Since PCA is a statistical method, more sample
vectors are required. If the frame offset is too large, there will be
too few signal vectors in the local temporal neighborhood, and
the PCA results will lose statistical significance. But the frame
offset is too small may also bring some problems such as
increasing the amount of calculation.
From the experience of the ordinary speech rate of
continuous speech signals, the duration of about 16 ms to 32 ms
corresponds to one pronunciation phoneme. Therefore, the local
time interval is 20 to 30 ms. Under the sampling frequency of 16
kHz, it is converted into the length of the number of sampling
points. That is to say that the length of the local interval is 320
to 480 sampling points. First, the frame length, frame offset, and
local interval should be set. And then a range of local intervals
should be taken from the beginning of the signal. Next, the local
signal will be framed to form a data matrix, and a complete PCA
analysis should be performed to obtain the number of principal
3. Continuous Speech Segmentation
Method Based on Local PCA
WSEAS TRANSACTIONS on SIGNAL PROCESSING
DOI: 10.37394/232014.2022.18.9
Zhaoting Liu, Zhongxiao Li,
Xiaodong Zhuang, Nikos Mastorakis
E-ISSN: 2224-3488
65
Volume 18, 2022
components. Repeat the above steps starting with the second
frame, the number of principal components that change with
time will be obtained. And the unvoiced and voiced sounds can
be effectively segmented by taking a certain threshold for the
result.
The overall PCA analysis of the monophone signal was
carried out to observe the change of the number of principal
components of the signal with the frame length under different
frame lengths. And we can know the difference between
different signals.
Fig. 2 PCA analysis of monophone signals
The results are shown in the figure below:
Fig. 3 The number of principal components varies with frame length (More
than 90% composition)
Fig. 4 The number of principal components varies with frame length (More
than 95% composition)
In Fig. 3 and Fig. 4, the abscissa represents the frame length,
and the ordinate represents the number of principal components.
The main components of some phonemes vary with the
frame length as shown in the tables below:
T
ABLE
1
T
HE NUMBER OF PRINCIPAL COMPONENTS VARIES WITH FRAME LENGTH
(
M
ORE THAN
90%
COMPOSITION
)
Frame
length
Signal
16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256
/u/
2 3 4 5 6 7 7 8 8 8 8 8 8 8 8 8
/o/
2 3 4 5 6 7 7 8 8 8 8 8 8 8 9 9
/sh/
6 11 15 19 23 28 31 35 39 43 46 50 52 55 58 61
/s/
4 6 9 11 13 16 18 20 22 24 26 28 31 32 34 36
16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256
Frame length
10
20
30
40
50
60
Number of principal components
u
o
i
e
a
sh
s
h
devoiced-e
devoiced-a
Number of principal components
4. Experimental Simulation Analysis
4.1 PCA Analysis of Monophone Signals
WSEAS TRANSACTIONS on SIGNAL PROCESSING
DOI: 10.37394/232014.2022.18.9
Zhaoting Liu, Zhongxiao Li,
Xiaodong Zhuang, Nikos Mastorakis
E-ISSN: 2224-3488
66
Volume 18, 2022
T
ABLE
2
T
HE NUMBER OF PRINCIPAL COMPONENTS VARIES WITH FRAME LENGTH
(M
ORE THAN
95%
COMPOSITION
)
Frame
length
Signal
16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256
/u/
2 4 5 6 7 8 8 9 9 9 9 9 9 9 9 10
/o/
3 4 5 7 8 9 10 11 11 12 12 12 12 13 13 13
/sh/
8 14 19 24 29 35 40 44 49 54 58 63 66 70 74 77
/s/
5 9 12 15 18 21 24 27 30 33 36 38 41 43 45 47
It can be seen from the figure and table that for voiced
sounds, it tends to a limit value, while for unvoiced sounds, it
increases approximately linearly with the increase of frame
length. Under the same frame length, the number of principal
components of different phoneme pronunciation signals is
different.
Based on the above results, a method for segmenting
different phonemes of continuous speech signals based on local
PCA analysis is proposed. That is, t aking some temporally
closed frames near a certain moment for PCA to check the
number of principal components of the local signal vector set
with the passage of time changes. Thus, the voiced and unvoiced
phonemes in word pronunciation can be judged and segmented.
Fig. 5 Flowchart of local PCA over time
The following figures show the local PCA analysis of the
three words signals of ‘face; show; wash’. The frame length is
128, the frame offset is 4, and the local range is 400 points. A
certain threshold (boundary value) is taken for the result curve.
Time-domain waveforms are compared to the results produced.
Fig. 6 Local PCA segmentation results of 'show' word signal (more than 90%
components)
Fig. 7 Local PCA segmentation results of 'show' word signal (more than 95%
components)
Fig. 8 Local PCA segmentation results of 'face' word signal (more than 90%
components)
0 500 1000 1500 2000 2500
0
10
20
30
0 2000 4000 6000 8000 10000 12000
-0.2
-0.1
0
0.1
0.2
0 500 1000 1500 2000 2500
0
10
20
30
40
0 2000 4000 6000 8000 10000 12000
-0.2
-0.1
0
0.1
0.2
0 500 1000 1500 2000
0
10
20
30
40
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
-0.5
0
0.5
4.2 Local P&$ Analysis of Continuous
Speech Signals
WSEAS TRANSACTIONS on SIGNAL PROCESSING
DOI: 10.37394/232014.2022.18.9
Zhaoting Liu, Zhongxiao Li,
Xiaodong Zhuang, Nikos Mastorakis
E-ISSN: 2224-3488
67
Volume 18, 2022
Fig. 9 Local PCA segmentation results of 'face' word signal (more than 95%
components)
Fig. 10 Local PCA segmentation results of 'wash' word signal (more than 90%
components)
Fig. 11 Local PCA segmentation results of 'wash' word signal (more than 95%
components)
The upper part of each image is the signal time-domain
waveform diagram, and the lower part is the graph of the number
of principal components finally obtained over time. The vertical
direction of the two images corresponds to the same time.
The thresholds of the number of principal components are
taken as 13 (take more than 90% components) and 18 (take more
than 95% components). Only from the distinction between
unvoiced and voiced sounds, it can be seen from the above result
graph that the position of the red vertical line in the figure can
be accurately correspond to the segmentation of unvoiced and
voiced sounds in the time domain waveform. That is, the method
can segment unvoiced and voiced sounds. However, in the
"face" signal, it can be seen that there is a silent signal in the
signal. Although the voiced and unvoiced sounds can still be
segmented, the existence of the silent signal cannot be
distinguished. Therefore, the silent signal should be extracted
first, and then segmented by the method in this paper.
This paper firstly studies the relationship between the
number of principal components and the frame length after the
monophone signal is divided into frames and reduced in
dimension. As the frame length increases, the number of
principal components tends to a limit for voiced sounds, while
for unvoiced sounds, the number of principal components
increases approximately linearly. And under the same frame
length, the number of principal components of different
phoneme pronunciation signals is different. Further research on
continuous speech segmentation by local PCA is carried out.
That is, the set of speech frames that are very close in time is
used for PCA analysis, and the graph of the number of local
principal components over time is obtained and compared with
the time-domain waveforms. It is found that the segmentation
of voiced and unvoiced sounds can be effectively performed by
setting the threshold. Future research will be carried out from
the segmentation of silent segments and unvoiced or voiced
sounds. We will strive to achieve high-accuracy real-time
segmentation for it that is different from traditional methods.
[1] Qizheng Huang, Changchun Bao, Xianyun Wang, Yang Xiang, "Speech
enhancement method based on multi-band excitation model ", Applied
Acoustics, 2020, Volume 163.
[2] J. Yang, Z. Li, and P. Su,“Review of speech segmentation and endpoint
detection,”Journal of Computer Applications, 2020, pp.1-7.
[3] D. Ridha and S. Suyanto, "Removing Unvoiced Segment to Improve Text
Independent Speaker Recognition," 2019 International Seminar on
Research of Information Technology and Intelligent Systems (ISRITI),
2019, pp. 50-53.
[4] A.K.Alimuradov, "Enhancement of Speech Signal Segmentation Using
Teager Energy Operator," 2021 23rd International Conference on Digital
Signal Processing and its Applications (DSPA), 2021, pp. 1-7.
[5] A.K. Alimuradov, "Speech/Pause Segmentation Method Based on Teager
Energy Operator and Short-Time Energy Analysis," 2021 Ural
Symposium on Biomedical Engineering, Radio electronics and
Information Technology (USBEREIT), 2021, pp. 0045-0048.
[6] R.Bachu, S. Kopparthi,B. Adapa, and B. Barkana,"Separation of voiced
and unvoiced using zero crossing rate and energy of the speech signal,”
American Society for Engineering Education(ASEE) Zone Conference
Proceedings,2008,pp. 1-7.
[7] K. Struwe, "Voiced-Unvoiced Classification of Speech Using a Neural
Network Trained with LPC Coefficients," 2017 International Conference
on Control, Artificial Intelligence, Robotics & Optimization (ICCAIRO),
2017, pp. 56-59.
[8] M. Musaev, I. Khujayorov and M. Ochilov, "The Use of Neural Networks
to Improve the Recognition Accuracy of Explosive and Unvoiced
Phonemes in Uzbek Language," 2020 Information Communication
Technologies Conference (ICTC), 2020, pp. 231-234.
[9] Herve Cardot, David Degras, "Online principal component analysis in
high dimension: Which algorithm to choose?” International Statistical
Review, 2018,pp.29-50.
0 500 1000 1500 2000
0
10
20
30
40
50
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
-0.5
0
0.5
0 200 400 600 800 1000 1200 1400 1600 1800 2000
0
10
20
30
0 1000 2000 3000 4000 5000 6000 7000 8000
-0.2
-0.1
0
0.1
0.2
0 200 400 600 800 1000 1200 1400 1600 1800 2000
0
10
20
30
40
0 1000 2000 3000 4000 5000 6000 7000 8000
-0.2
-0.1
0
0.1
0.2
5. Conclusion
References
WSEAS TRANSACTIONS on SIGNAL PROCESSING
DOI: 10.37394/232014.2022.18.9
Zhaoting Liu, Zhongxiao Li,
Xiaodong Zhuang, Nikos Mastorakis
E-ISSN: 2224-3488
68
Volume 18, 2022
[10] S. Xiangbo and T. Wei, "Research on Multidimensional User Experience
Evaluation Model Based on Principal Component Analysis," 2020 IEEE
International Conference on Power, Intelligent Computing and Systems
(ICPICS), 2020, pp. 554-557.
[11] S. Alakkari and J. D ingliana, "Modelling Large Scale Datasets Using
Partitioning-Based PCA," 2019 IEEE International Conference on Image
Processing (ICIP), 2019, pp. 2646-2650.
[12] F. Jing, H. Shaohai and M. Xiaole, "SAR image de-noising via grouping-
based PCA and guided filter," in Journal of Systems Engineering and
Electronics, 2021, pp. 81-91.
[13] Z. Xia, Y. Chen and C. Xu, "Multiview PCA: A Methodology of Feature
Extraction and Dimension Reduction for High-Order Data," in IEEE
Transactions on Cybernetics.
[14] I.T. Jolliffe, Principal Component Analysis, 2nd ed. New York, NY, USA:
Springer-Verlag, 2002.
[15] J. Ye, R. Janardan, and Q.Li, "GPCA: An efficient dimension reduction
scheme for image compression and retrieval," in Proc. KDD,2004,
pp.354-363.
Creative Commons Attribution License 4.0
(Attribution 4.0 International, CC BY 4.0)
This article is published under the terms of the Creative
Commons Attribution License 4.0
https://creativecommons.org/licenses/by/4.0/deed.en_US
WSEAS TRANSACTIONS on SIGNAL PROCESSING
DOI: 10.37394/232014.2022.18.9
Zhaoting Liu, Zhongxiao Li,
Xiaodong Zhuang, Nikos Mastorakis
E-ISSN: 2224-3488
69
Volume 18, 2022