A Classification Study in High-Dimensional Data of Linear
Discriminant Analysis and Regularized Discriminant Analysis
AUTCHA ARAVEEPORN, SOMSRI BANDITVILAI
Department of Statistics, School of Science,
King Mongkut's Institute of Technology Ladkrabang
10520, Bangkok,
THAILAND
Abstract: - The objective of this work is to compare linear discriminant analysis (LDA) and regularized
discriminant analysis (RDA) for classification in high-dimensional data. This dataset consists of the response
variable as a binary or dichotomous variable and the explanatory as a continuous variable. The LDA and RDA
methods are well-known in statistical and probabilistic learning classification. The LDA has created the
decision boundary as a linear function where the covariance of two classes is equal. Then the RDA is extended
from the LDA to resolve the estimated covariance when the number of observations exceeds the explanatory
variables, or called high-dimensional data. The explanatory dataset is generated from the normal distribution,
contaminated normal distribution, and uniform distribution. The binary of the response variables is computed
from the logit function depending on the explanatory variable. The highest average accuracy percentage
evaluates to propose the performance of the classification methods in several situations. Through simulation
results, the LDA was successful when using large sample sizes, but the RDA performed when using the most
sample sizes.
Key-Words: - high-dimensional data, linear discriminant analysis, regularized discriminant analysis
Received: September 11, 2022. Revised: April 2, 2023. Accepted: April 26, 2023. Published: May 10, 2023.
1 Introduction
The discriminant analysis is a statistical technique
that is helped the researcher to separate response
variables in terms of categorical data depending on
the explanatory variable. This method comprises a
discriminant function or decision function in the
form of a linear or quadratic function to divide two
or more classes of the response variable. [1],
illustrated the discriminant analysis to challenge the
classifying data. This paper demonstrated that the
discriminant analysis had good predictive accuracy
in the normal distribution. [2], applied the cosine
similarity measure based on decision rue in the
discriminant analysis.
Linear discriminant analysis is a well-known
technique for dimensionality reduction problems.
Pre-processing step is a machine learning and
pattern classification application, [3]. This technique
comes from the assumption of a standard covariance
matrix based on the multivariate normal
distribution. The decision boundary function is
created for computing the population. The
maximization of the likelihood function is to
evaluate the observation and the proportion of each
population. [4], applied linear discriminant analysis
for small sample sizes in the classification of face
recognition, bioinformatics, and text recognition.
[5], developed the linear discriminant analysis to
neighborhood linear discriminant analysis. Then, the
scatter matrices are defined on a neighborhood
consisting of reverse nearest neighbors.
When the assumption of the covariance matrix
has an individual for each group, this leads to so-
called quadratic discriminant analysis. The linear
discriminant analysis is straightforward, where the
number of observations is greater than the number
of the explanatory variable. However, it becomes a
severe problem where the number of the
explanatory variable is greater than the number of
observations, or it defines the high-dimensional
data. The quadratic discriminant analysis cannot be
inverted for computation because the sample
covariance matrix is singular. To overcome these
problems, the linear discriminant analysis makes
some adaptations to a new method as regularized
discriminant analysis, [6]. [7], improved the
covariance in regularized discriminant analysis on
the high-dimensional low-sample size data for the
ill-posed inverse problem. [8], conducted a large
dimensional experiment of regularized discriminant
analysis classifiers with its two popular methods,
known as regularized LDA and QDA.
WSEAS TRANSACTIONS on MATHEMATICS
DOI: 10.37394/23206.2023.22.37
Autcha Araveeporn, Somsri Banditvilai
E-ISSN: 2224-2880
315
Volume 22, 2023
The LDA has extended to flexible discriminant
analysis (FDA), [9], a valuable multigroup
classification tool. FDA obtained nonparametric
versions of discriminant analysis by replacing linear
regression with any nonparametric regression
method, and this technique can improve its
classification performant results. [10], considered
the high-dimensional data for the within-class
covariance singular matrix, called penalized LDA,
that evaluated the performance of the resulting
methods in the simulation study. [11], described a
penalized version of LDA designed for highly
correlated independent variables. [12], fitted the
Gaussian mixture to each class to facilitate effective
classification in non-normal settings.
This article aims to study the binary
classification of high-dimensional data by
comparing LDA and QDA. Through simulation
data, we generate explanatory variables from the
normal distribution, contaminated normal
distribution, and uniform distributions, while
response variables are obtained from the logit
function. The maximum average accuracy
percentage investigates the performance of two
methods.
This study is divided into four sections: the first
section discusses the importance and background of
linear discriminant analysis and regularized
discriminant analysis. Section 2, the general
definitions related to discriminant analysis, proposes
the theorems of these methods. Section 3 presents
the simulation study and results used to construct
the response and explanatory variables in the high-
dimensional data. A discussion of our simulation
results is shown in section 4. Finally, the conclusion
and recommendations are provided in Section 5.
2 Discriminant Analysis
The explanation of LDA and RDA relates to the
Bayes theory concept based on a multivariate
normal distribution. The two classes have a normal
distribution in the real world, then
2
2
1 1 1
2
22
( , ), if ,
( , ), if
N x C
xN x C


where
1
C
and
2
C
denote the first and the second
class. The definition of the probability distribution is
and
where the prior distributions denote
1
fx
and
2
fx
by
1
and
2
. According to the Bayes
theorem, the posterior distribution is written by
11
1
11
1
|(
()
() ,
)
|
|
C
k
kk
P X x x P x
Px PX
Xx x
fx
P X x x

CC
C
C
(1)
where
C
is the number of class. The likelihood and
the prior functions of class are
1
fx
and
1
.
Therefore, the posterior distribution in (1) becomes
1 1 1 2
11
( ) ( ) ,
||
kk
CC
kk
kk
f x f x
P X x x P X x x




CC
then
21 21 .f x f x

(2)
Now, the thinking of a multivariate dataset of
discriminant analysis is
12
( , ,..., )
n
= x x xx
with
n
observations where
12
( , ,..., ) , 1,2,...,
i i i ip
x x x x i n

in
p
variables. This dataset focuses on the multivariate
normal distribution called
~ ( , )Nx
. The
probability distribution function for
x
is
1
1
( ) ,
2
2
T
p
f exp






xx
x | ,

(3)
where
12
( , ,..., )
p
=
denotes the mean of
the dataset,
denotes the covariance matrix, and
1
denotes the inverse of the covariance matrix.
Therefore, the two classes of multivariate
normal distribution in (2) and (3) become
11
1
1
22
2
2
1
1
1
2
1
2
2
1,
2
2
T
p
T
p
exp
exp













xx
xx


(4)
WSEAS TRANSACTIONS on MATHEMATICS
DOI: 10.37394/23206.2023.22.37
Autcha Araveeporn, Somsri Banditvilai
E-ISSN: 2224-2880
316
Volume 22, 2023
2.1 Linear Discriminant Analysis
The linear discriminant analysis mentions the equal
covariance matrix on two classes
12

,
[13]. Therefore, the probability distribution function
in (3) becomes:
11
1
22
2
1
1
1exp 2
2
1exp ,
2
2
T
p
T
p













xx
xx


(5)
where
1
and
2
are the probability of two
classes, and
1
and
2
are the mean of two
classes.
Take the natural logarithm in (5) two sides, and
the simplified term shows that
1 1 1 1
1 1 1
11 ln( )
22
T T T
x x x
2 2 2 2
1 1 1
11 ln( ).
22
T T T
x x x
(6)
where (6) is
11
11TT
xx

, and multiply
two sides by two, and we have:
2 1 2 1 2 1
2
1
11
2
2ln 0.
TT





x
(7)
For obtaining (7), this equation can be seen in the
form of a linear function
T
A x+ b = 0
which is
called the LDA. The decision boundary to
discriminate the two classes is
2 1 2 1 2 1
2
1
11
ˆ ˆ ˆ ˆ ˆ ˆ
( ) 2
ˆ
2ln .
ˆ
ˆTT




xx
(8)
The classification corresponds to assigning two
classes as
1 , ( ) 0
() 2 , ( ) 0
if
xif
x
x
. (9)
The parameters associated with (9) are
approximated from the multivariate dataset as the
mean and covariance matrices following:
1
ˆ, 1,2
k
n
i
i
k
k
k
n

x
1 1 2 2
ˆˆ
( 1) ( 1)
ˆ,
2
nn
n

12
n n n
,
1
1
ˆˆˆ
1
k
nT
k i k i k
i
k
n
xx

and
ˆ,
k
k
n
n
where
ˆ
is called the pooled covariance matrix.
2.2 Regularized Discriminant Analysis
In high-dimensional data, the performance of linear
discriminant analysis is far from optimal since the
lack of observation is unstable data. Therefore, the
regularized discriminant analysis is proposed to
resolve the singularity problem. [14], proposed the
regularization in a covariance matrix
(
) by defining
ˆ(1 ) p
I
, (10)
where
is defined as the regularized parameter on
values
01

. Then, the regularization probably
is adjusted by the sample correlation matrix
1/2 1/2
ˆ
ˆ ˆ ˆ
R D D
in the same way,
ˆ(1 ) ,
p
R R I
(11)
where
ˆ
D
is the diagonal matrix of the pooled
covariance matrix (
ˆ
). Then, the regularized
covariance matrix is modified by (10) and (11) as
1/2 1/2
ˆˆ
.D RD
(12)
Now, the decision boundary depends on regularized
WSEAS TRANSACTIONS on MATHEMATICS
DOI: 10.37394/23206.2023.22.37
Autcha Araveeporn, Somsri Banditvilai
E-ISSN: 2224-2880
317
Volume 22, 2023
covariance matrix that can define the corresponding
linear discriminant analysis as,
2 1 2 1 2 1
2
1
11
ˆ ˆ ˆ ˆ ˆ ˆ
( ) 2
ˆ
2ln ,
ˆ
TT




xx
(13)
where the
can be from (12) and the classify
criterion is the same as (9).
3 Simulation Study and Results
The simulation study will classify the binary
response variables (
y
) based on an explanatory
variable (
x
) by using linear discriminant analysis
and regularized discriminant analysis. The
explanatory variables are generated on the normal
distribution, contaminated normal distribution, and
uniform distribution.
The normal distribution is the common data with
parameter
and variance
2
in the following
function:
2
2
()
22
2
2
1
( ; , ) ,
2
, , 0.
x
f x e x
 



The simulation data is generated from a normal
distribution with a mean of zero and a variance of
twenty-five or called
2
( , ) (0,25)NN

and the
probability density is shown in Fig. 1.
Fig. 1: The normal probability density with mean
zero and variance twenty-five.
The contaminated normal distribution is a mixture
of two normal distributions with a mixing
probability of contaminated data
p
and
1p
,
where
0 0.1p
. Then the contaminated normal
probability density is
2 2 2 2
; , (1 ) ( , ) ( , )f x p N pN c
,
where
c
is a parameter that determines the wider
standard deviation. In this case, we used the ten
percent of contaminated data (
0.1p
) and
5c
.
The mean and variance are defined as normal
distribution, and the histogram of the contaminated
normal distribution is shown in Fig. 2.
Fig. 2: The histogram of contaminated normal
distribution with mean zero, variance twenty-five,
0.1p
, and
5c
.
Finally, the uniform distribution is the
symmetric distribution with parameters
a
and
b
,
which are the minimum and maximum values. The
uniform probability density is written by
1
( ) , ,f x a x b
ba
where the mean is
() 2
ba
EX
, and variance is
2
()
() 12
ba
Var X
. This explanatory variable is
simulated in the range of -2 to 2 with a mean zero
and a variance of 1.333. The probability density is
exhibited in Fig. 3.
WSEAS TRANSACTIONS on MATHEMATICS
DOI: 10.37394/23206.2023.22.37
Autcha Araveeporn, Somsri Banditvilai
E-ISSN: 2224-2880
318
Volume 22, 2023
Fig. 3: The uniform probability density in the range
of -2 to 2.
Through simulation, the explanatory variables
are greater than the observed data
()n
standing on
the high-dimensional data. The number of
explanatory variables is defined as
30
( 15, 20, 25)n
,
60
( 20, 30, 40, 50, 55)n
,
and 100
( 20, 30, 40, 50, 70, 95)n
. The
response variable is obtained from the logit function
() 1
i
i
ie
pe
x
x
x
, where
x
are the explanatory
and
are the parameter of correlation coefficients.
If
( ) 0.5
i
px
, the response variables are denoted
as
1
i
y
, and
0
i
y
, when
( ) 0.5.
i
px
The R program was conducted to simulate data
and approximated the decision boundary to classify
the response variable. The confusion matrix was
created to decide the performance of these
classification methods. The predicted data were
evaluated to compare with the real data using the
accuracy percentage (Table 1).
Table 1. The confusion matrix of real data
(
i
y
) and predicted data (
ˆi
y
).
Accuracy Percentage 100.
TP TN
TP TN FP FN

The average accuracy percentage for the
classification of the linear discriminant analysis and
regularized discriminant analysis are shown in Table
2, Table 3, and Table 4. Then Fig. 4, Fig. 5, and Fig.
6 show the average accuracy percentage trend when
sample sizes are increased.
Table 2. The average accuracy percentage of linear discriminant analysis (LDA) and regularized
discriminant analysis (RDA) under 30 independent variables.
In Table 2, the RDA employs the highest
average accuracy percentage in all cases. It can
see that the increased sample size of RDA does
not affect classification except for LDA. When
the sample sizes increase, the average accuracy
percentage of LDA is increased, as shown in
Fig. 4.
Predicted data
Real data
1
i
y
0
i
y
ˆ1
ii
y
True Positive
(TP)
False Positive
(FP)
ˆ0
i
y
False Negative
(FN)
True Negative
(TN)
Sample Sizes
(
n
)
Normal
Contaminated Normal
Uniform
LDA
RDA
LDA
RDA
LDA
RDA
15
85.14
99.60
84.21
97.13
85.69
99.72
20
92.71
99.63
90.64
97.12
93.15
99.54
25
98.44
99.36
96.30
96.52
98.44
99.27
WSEAS TRANSACTIONS on MATHEMATICS
DOI: 10.37394/23206.2023.22.37
Autcha Araveeporn, Somsri Banditvilai
E-ISSN: 2224-2880
319
Volume 22, 2023
Fig. 4: The trend of the average accuracy percentage of linear discriminant analysis (LDA) and regularized
discriminant analysis (RDA) under 30 independent variables.
Table 3. The average accuracy percentage of linear discriminant analysis (LDA) and regularized
discriminant analysis (RDA) under 60 independent variables.
From the average accuracy percentage in Table 2,
the RDA is appropriate for the small sample sizes,
but LDA outperforms the large sample sizes. The
average accuracy percentage of LDA is increased
when the sample sizes increase, as shown in Fig. 5.
Fig. 5: The trend of the average accuracy percentage of linear discriminant analysis (LDA) and regularized
discriminant analysis (RDA) under 60 independent variables.
Sample Sizes
(
n
)
Normal
Contaminated Normal
Uniform
LDA
RDA
LDA
RDA
LDA
RDA
20
77.71
99.86
79.62
98.64
77.04
99.82
30
85.61
99.67
87.19
98.28
85.64
99.65
40
94.00
99.31
93.05
98.15
94.19
99.52
50
99.27
99.23
97.94
97.78
99.28
99.16
55
99.94
99.06
99.50
97.66
99.95
99.14
WSEAS TRANSACTIONS on MATHEMATICS
DOI: 10.37394/23206.2023.22.37
Autcha Araveeporn, Somsri Banditvilai
E-ISSN: 2224-2880
320
Volume 22, 2023
Table 4. The average accuracy percentage of linear discriminant analysis (LDA) and regularized
discriminant analysis (RDA) under 100 independent variables.
Sample Sizes
(
n
)
Normal
Contaminated Normal
Uniform
LDA
RDA
LDA
RDA
LDA
RDA
20
70.74
99.96
73.61
99.45
70.70
99.96
30
75.60
99.87
78.69
99.14
75.33
99.89
40
80.75
99.80
83.32
99.09
80.82
99.79
50
85.92
99.54
87.51
98.83
86.16
99.62
70
96.14
99.33
94.93
98.58
96.29
99.42
95
99.99
98.99
99.94
98.19
99.99
98.89
According to the results in Table 4, the RDA
performs well in most cases, but the LDA is a
perfect classification in the largest sample sizes. The
average LDA accuracy percentage increases when
the sample sizes increase, as shown in Fig. 6.
Fig. 6: The trend of the average accuracy percentage of linear discriminant analysis (LDA) and regularized
discriminant analysis (RDA) under 100 independent variables.
4 Discussion
The classification performance for the binary
response variable depended on the explanatory
variables via the normal, contaminated normal, and
uniform distributions shown in Table 2, Table 3, and
Table 4. Starting with the first table, the average
accuracy percentage in RDA for small explanatory
variables is more significant than LDA for all
sample sizes. Moreover, when the explanatory
variables are increased to the moderate and high
range, the average accuracy percentage in RDA is
more significant than LDA in most sample sizes, as
shown in Table 3 and Table 4. Meanwhile, in the
largest sample sizes, the average accuracy
percentage in LDA is greater than RDA. The
average accuracy percentage increases when the
sample sizes are increased, as shown in Fig. 4, Fig.
5, and Fig. 6. The several distributions give the
same performance methods, but the normal and
uniform distributions present the highest average
accuracy percentage. The choice of data distribution
plays a vital role in good classification accuracy,
[15].
5 Conclusion
This paper provided a binary classification by
applying the high-dimensional data for linear
discriminant analysis (LDA) and regularized
discriminant analysis (RDA). We explained the
benefit of explanatory variables on several
distributions for predicting binary response
WSEAS TRANSACTIONS on MATHEMATICS
DOI: 10.37394/23206.2023.22.37
Autcha Araveeporn, Somsri Banditvilai
E-ISSN: 2224-2880
321
Volume 22, 2023
variables. Through a simulation study, the RDA
outperformed more than the LDA in most sample
sizes. However, the LDA was reasonable working
with the largest sample sizes.
When considering the distribution, the average
accuracy percentage of the normal and uniform
distributions was slightly different because of the
symmetric distribution. In the case of outlier data,
the RDA performed well for classification. These
results explained that the RDA was adequate for a
classification based on high-dimensional data in
most cases. Therefore, we concluded that the RDA
could classify the situation of the sizeable
explanatory variable and the sample sizes.
Furthermore, the RDA was recommended for small
sample sizes, [16], and large dimensional data, [17].
For future work, the RDA can apply the
classification of psychological tasks, [18].
The simulation data is mainly used in this
research. For future work, the real dataset in high-
dimensional distribution, especially medical data
such as large-scale gene expression data for
classification disease in small patients. This research
focuses the discriminant classification. Then the
machine learning method can apply in this case.
Acknowledgments:
This research is supported by King Mongkut's
Institute of Technology Lad-krabang.
References:
[1] T. Ramayah, N.H. Ahmad, H. A. Halim,
S, R. M. Zanai, M. H. Lo, Discriminant
analysis: An illustrate example, African
Journal of Business Management, Vol.4,
No.9, 2010, pp. 1654-1667.
[2] C. Liu, Discriminant analysis and similarity
measure, Pattern Recognition, Vol.47, No.1,
2014, pp.359 -367.
[3] A. Tharwat, T. Gaber, A. Ibrahim, A. E.
Hassanien, Linear discriminant analysis:
detailed tutorial, Ai Communications, Vol.30,
No.2, 2017, pp.169-190.
[4] A. Sharma, K. K. Paliwal, Linear discriminant
analysis for the small sample size
problem: an overview, International Journal
of Machine Learning and Cybernetics, Vol.6,
2017, pp.443-454.
[5] F. Zhu, J. Gao, J. Yang, N. Ye, Neighborhood
linear discriminant analysis, Pattern
Recognition, Vol.123, No.2, 2022, Article no.
108422.
[6] F. H. Friedman, Regularized discriminant
analysis, Journal of the American Statistical
Association, Vol.84, 1889, pp.165-175.
[7] S. Yang, H. Xiong, K. Xu, L. Wang, J. Bian,
Z. Sun, Improving covariance-regularized
discriminant analysis HER-based predictive
analytics of diseases, Applied Intelligence,
Vol.51, 2021, pp.377-395.
[8] K. Elkhalil, A. Kammoun, R. Couillet, T. Y.
Al-Naffouri, M. S. Alouini, A large
dimensional study of regularized discriminant
analysis, IEEE Transections on Signal
Processing, Vol.68, 2020, pp.2464-2479.
[9] H. Trevor, R. Tibshirani, A. Buja. Flexible
Discriminant Analysis by Optimal Scoring,
Journal of the American Statistical
Association, Vol.89, No. 428, 1994, pp.1255-
1270.
[10] D.M. Witten, R. Tibshirani, Penalized
Classification using Fisher’s Linear
Discriminant, The Journal of the Royal
Statistical Society, Series B, Vol.75, No.5,
2011, pp.753-772.
[11] T. Hastie, A. Buja, R. Tibshirani, Penalized
Discriminant Analysis, The Annals of
Statistics, Vol. 23, No. 1, pp. 73-102.
[12] T. Hastie, R. Tibshirani, Discriminant
Analysis by Gaussian Mixtures, The Journal
of the Royal Statistical Society, Series B,
Vol.58, No.1, 1966, pp.155-176.
[13] B. Ghojogh, M. Crowley, Linear and
quadratic discriminant analysis: Tutorial,
2019, Available at http:// arXiv preprint
arXiv:1906.02590.
[14] Y.Guo, T. Hastie, R. Tibshirani, Regularized
Discriminant Analysis and Its Application in
Microarrays, Biostatistics, Vol. 1, No.1,2005,
pp. 1-8.
[15] K. Ksushbu, P. Nishad, V. Kasyap, I. Gupta.
A Classification and Distribution Model for
Data Leakage Prevention and Detection,
International Research Journal of
Modernization in Engineering Technology
and Science, Vol. 3, No.2, 2021, pp. 348-354.
[16] J. Ye, T. Wang, Regularized discriminant
analysis for high dimensional, low sample
size data, Proceeding of the 12th ACM
SIGKDD international conference on
Knowledge discovery and data mining.
Philadelphia, Pennsylvania, USA, 2006,
pp.454-463.
[17] X. Yang, K. Elkhalil, A. Kammoun, T. Y. Al-
Naffouri, M. S. Alouini, Regularized
Discriminant Analysis: A large Dimensional
Study, 2018 IEEE International Symposium
on Information Theory (ISIT). Vali, Colorado,
WSEAS TRANSACTIONS on MATHEMATICS
DOI: 10.37394/23206.2023.22.37
Autcha Araveeporn, Somsri Banditvilai
E-ISSN: 2224-2880
322
Volume 22, 2023
USA, 2018, pp. 536-540.
[18] R. Fu, M. Han, Y. Tian, P. Shi, Improvement
motor imagery EEG classification based on
sparse common spatial pattern and regularized
discriminant analysis, Journal od
Neuroscience. Vol. 343, 2020, Article no.
108833.
Contribution of Individual Authors to the
Creation of a Scientific Article:
-Autcha Araveeporn has conceptualized the research
and organized the simulation process to the
discussion.
-Somsri Banditvilai has derived the results and
made the conclusion.
Sources of Funding for Research Presented in a
Scientific Article or Scientific Article Itself:
This research is supported by King Mongkut's
Institute of Technology Ladkrabang.
Conflict of Interest
The authors have no conflict of interest to declare.
Creative Commons Attribution License 4.0
(Attribution 4.0 International, CC BY 4.0)
This article is published under the terms of the
Creative Commons Attribution License 4.0
https://creativecommons.org/licenses/by/4.0/deed.en
_US
WSEAS TRANSACTIONS on MATHEMATICS
DOI: 10.37394/23206.2023.22.37
Autcha Araveeporn, Somsri Banditvilai
E-ISSN: 2224-2880
323
Volume 22, 2023