Comparison of Logistic Regression and Discriminant Analysis for
Classification of Multicollinearity Data
AUTCHA ARAVEEPORN
Department of Statistics, School of Science,
King Mongkut's Institute of Technology Ladkrabang,
Bangkok, 10520,
THAILAND
Abstract: - The objective of this study is to concentrate on the classification method of the logistic regression
and the discriminant analysis by using the simulation dataset and the liver patients as the actual data. These
datasets are used the binary dependent variable depending on the correlated independent variables or called
multicollinearity data. The standard classification method is logistic regression, which uses the logit function's
probability to conduct the dichotomous dependent variable. The iteration process can be solved to estimate logit
function parameters and explain the relationship between a dependent binary variable and independent
variables. Discriminant analysis is a powerful classification based on linear discriminant analysis (LDA),
quadratic discriminant analysis (QDA), and regularized discriminant analysis (RDA). These methods consider
the decision boundaries by building a classifier model on the multivariate normal distribution. LDA defines the
standard covariance matrix, but QDA has an individual covariance matrix. RDA extends from QDA by setting
the regularized parameter to estimate the covariance matrix. In the case of the simulation study, the independent
variables are generated by defining the constant correlation on the multivariate normal distribution that made
the multicollinearity problem. Then the binary response variable can be approximated from the logit function.
For application to actual data, we expressed the classification of type liver and non-liver patients as the
dependent variables and obtained patient personal information on the nine independent variables. The highest
average percentage of accuracy determines the performance of these methods. The results have shown that the
logistic regression was successful when using small independent variables, but the RDA performed when using
large independent variables.
Key-Words: - linear discriminant analysis, quadratic discriminant analysis, regularized discriminant analysis
Received: October 19, 2022. Revised: December 17, 2022. Accepted: January 15, 2023. Published: February 16, 2023.
1 Introduction
The regression analysis has concern with describing
the relationship between the dependent variable and
one or more independent variables. The dependent
variable is continuous, but sometimes the outcome
variable is discrete. Then, the logistic regression has
become to solve this problem.
The studying of logistic regression is an
effective statistical technique used to find the best
fitting in biological and medical data. The logistic
regression model distinguishes from the regression
model that the outcome variable is binary or
dichotomous. Nevertheless, the difference between
logistic regression and regression analysis is the
modeling parameter and the assumption. Lever et
al., [1], expressed that the logistic regression was a
powerful tool for predicting the class by probability
and classified the binary independent variable.
The one assumption of logistic regression is no
multicollinearity problems between different
independent variables. It is often the case that the
independent variables play the correlation and make
a misleading conclusion. The importance of
multicollinearity data, [2], was studied in classifying
gully erosion by comparing the discriminant
analysis.
The basic idea of discriminant analysis helps
separate two or more groups of observed data and
creates decision boundaries in the form of linear and
quadratic for classification. The composition of
discriminant analysis consists of the categorical
dependent variable and the independent variables,
which rely on the multivariate normal distribution
with equal and unequal covariance. The usefulness
of classification by discriminant analysis was
developed in small sample sizes, [3], and a new
theory, [4]. The illustration of discriminant analysis
was to determine from the medical data that the
patients had suffered a heart attack, [5], to classify
whether the patient would survive or not survive
WSEAS TRANSACTIONS on MATHEMATICS
DOI: 10.37394/23206.2023.22.15
Autcha Araveeporn
E-ISSN: 2224-2880
120
Volume 22, 2023
based on independent variables. The discriminant
analysis is expanded to linear discriminant analysis
(LDA) and quadratic discriminant analysis (QDA).
LDA is a classification and dimensionality
reduction technique that can be explained from two
perspectives. The first idea is sounding a
probabilistic interpretation, and the second is the
creation of interpretation. Tharwat et al., [6], studied
the working of linear discriminant analysis to apply
in different applications. This experiment concluded
that LDA is robust for classification accuracy. Zhu
et al., [7], assumed that all data were independently
and identically distributed. The neighborhood linear
discriminant was proposed when the assumption did
not hold. Dudoit et al., [8], compared the
discriminant method's performance for tumor
classification using gene expression data. The
assumption of LDA has defined the equal variance
of all classes, and the decision boundary is
calculated as a linear function.
In particular, when the covariance matrix has an
individual, this leads to so-called QDA; and the
decision boundary is created in the quadratic
function. Under the multivariate normal distribution
and assuming the mean and covariance matrix for
each class, the parameter of the decision boundary
produces by the maximum likelihood method,
which becomes the unbiased estimator, [9].
Tharwat, [10], was to collect the essential
background of LDA and QDA in different
applications for classification. The discriminant
function and decision boundaries were highlighted
with numerical illustrations.
The improvement of LDA and QDA was to
regularize the individual covariance matrix.
Friedman, [11], proposed the regularized parameter
to control the covariance matrix's shrinkage, called
regularized discriminant analysis (RDA). Pima and
Aladjem, [12], studied the RDA in the classification
of face recognition and checked RDA sensitivity to
different methods of photometric preprocessing.
Elkhalil et al., [13], conducted a sizeable
dimensional experiment of RDA classifiers with its
two popular methods, known as regularized LDA
and QDA.
This research aims to investigate the classification
of logistic regression analysis, LDA, QDA, and
RDA. For this purpose, we propose the independent
variables under the multicollinearity problem and
the binary dependent variables whose conditional
distribution is from a multivariate normal
distribution. The actual data was applied to classify
the liver and non-liver patients from northeast
Andhra Pradesh, India, with eight independent
variables. Their percentage of accuracy determined
the performance of the four methods.
2 Classification Methods
Logistic regression and discriminant analysis are the
main methods to classify multicollinearity data.
2.1 Logistic Regression Method
The logistic regression model is created when the
dependent variable (
Y
) is of the binary or
dichotomic data based on the independent variable
(
X
). This methodology is to study medical diseases
such as the helpful in predicting the presence or
absence of evidence of coronary heart disease, [14].
The independent variable can be the continuous
variable or categorical data.
To understand the creation of logistic regression
model, start the condition probability of the
dependent variable given the independent variable
called
( | )P Y X
. The class of
Y
is denoted “1” as
success and “0” as failure, then
becomes the
binary variable depending on the Bernoulli
distribution. One can verify that
( 1) ( )P Y E Y
,
and the conditional probability is
( 1| ) ( | )P Y X x E Y X x
. It can assume
that
( 1| ) ( )P Y X x p x
and
( 0| ) 1 ( )P Y X x p x
. The likelihood
function is written by
1
11
( | ) ( ) (1 ( )) .
ii
nn
yy
i i i i
ii
P Y y X x p x p x


(1)
The basic idea is to let
()px
be a linear
function, but there are an unbounded since
0 ( ) 1px
. The next idea is to let
log ( )px
be a
linear function, but there are an unbounded on
logarithm. Finally, the logit transformation
()
log1 ( )
px
px
is a linear function without unbounded
problems. This last choice is called logistic
regression. Formally, the logistic regression mode is
0 1 1
()
log ...
1 ( )
,
ii k ki
i
i
px xx
px
x
(2)
where
is the vector of coefficient of logit
transformation,
i
x
is the set of the independent
variable,
k
is the number of independent variables,
and
n
is the observed dataset.
From (1), The likelihood function is then
WSEAS TRANSACTIONS on MATHEMATICS
DOI: 10.37394/23206.2023.22.15
Autcha Araveeporn
E-ISSN: 2224-2880
121
Volume 22, 2023
1
1
( ) ( ) (1 ( )) .
ii
nyy
ii
i
l p x p x

(3)
The log-likelihood from (3) turns into
summation as
1
1
1
1
ln ( ) log( ( )) (1 )log(1 ( ))
()
log log(1 ( ))
1 ( )
log( ) log(1 )
( ) log(1 ) .
ii
i
n
i i i i
i
ni
ii
ii
nxx
i
i
nx
ii
i
l y p x y p x
px
y p x
px
y e e
y x e












The maximum likelihood is estimated by differential
the log-likelihood with respect to the parameter
following
11
1
ln ( ) 1
1
( ( ; ))
i
i
nn
x
i i i
x
ii
n
i i i
i
le x y x
e
y p x x


. (4)
The (4) is not going to be zero, so the maximum
likelihood estimator cannot solve this formula. To
find the value of
that maximizes
log ( )l
by
using iterative techniques, [15]. Newton-Raphson
method studied by Akram and Ann, [16], that
played the performance method into the available
logistic regression model. The simplest case of
Newton-Raphson method is to minimize one scalar
variable
()f
and to find the global minimum
*
. It can assume that
()f
is smooth function and
*
is a regular interior minimum. Near the
minimum can approximate by a Taylor expansion:
2
2*
2
1
( ) ( *) ( *) |
2
df
ff d

, (5)
where
()f
close to quadratic near minimum.
Next, Newton’s method uses to minimize quadratic
approximation in (5). Guss an initial point
(0)
is
used to solve this problem by taking a second order
Taylor expansion around
(0)
following
(0)
(0)
(0) (0) 2
2
(0) 2
2
( ) ( ) ( ) |
1( ) |
2
df
ff dw
df
dw




. (6)
Let’s abbreviate the derivatives
(0) (0)
| ( )
df f
dw


and
(0)
2(0)
2| ( )
df f
dw


. From (6), this takes
the derivative with respect to
and set to
equal zero and called
(1)
:
(0) (0) (1) (0)
(0)
(1) (0)
(0)
1
0 ( ) ( )2( )
2
()
()
ff
f
f



.
The approximation is to get by using the iterative
process by minimizing one approximation:
()
( 1) ( )
()
()
.
()
n
nn
n
f
f



2.2 Discriminant Analysis Methods
The most often applied discriminant analysis are
depended on the multivariate normal distribution or
called
~ ( , )Nx
which is written the
probability distribution function in form of
1
1
( ) ,
2
2
T
p
f exp






xx
x | ,

(7)
where
12
( , ,..., )
p
= x x xx
presents the
independent variables,
12
( , ,..., )
p
=
represents the mean of independent variables,
presents the determinant of the covariance matrix,
and
1
presents an inverse of the covariance
matrix.
2.2.1 Linear Discriminant Analysis (LDA)
The covariance matrix of LDA assumes the equal of
the binary classification as
12

, [17].
Then, the probability distribution function from (7)
can be written as
11
1
22
2
1
1
1
2
2
1,
2
2
T
p
T
p
exp
exp













xx
xx


(8)
WSEAS TRANSACTIONS on MATHEMATICS
DOI: 10.37394/23206.2023.22.15
Autcha Araveeporn
E-ISSN: 2224-2880
122
Volume 22, 2023
where
1
and
2
are the probability of two
classes, and
1
and
2
are the mean of two
classes.
Using the natural logarithm in (8), the simplified
term can be written as:
1 1 1 1
1 1 1
11 ()
22
T T T ln
x x x
2 2 2 2
1 1 1
11 ln( ),
22
T T T
x x x
(9)
where (9) is
11
11TT
xx

, and
multiply two sides by two, we get:
2 1 2 1 2 1
2
1
11
2
2ln 0.
TT





x
(10)
Thus, the (10) can be seen in the form of
T
A x+ b = 0
which is called the LDA. The
decision boundary discriminates the two classes by
2 1 2 1 2 1
2
1
11
ˆ ˆ ˆ ˆ ˆ ˆ
( ) 2
ˆ
2ln .
ˆ
ˆTT




xx
(11)
The class corresponds to assign an observation
x
where
1 , ( ) 0
() 2 , ( ) 0
if
xif
x
x
. (12)
The parameters associated in (12) are estimated
the mean and covariance matrices by their sample
dataset as
1
ˆ, 1,2
k
n
i
i
k
k
k
n

x
1 1 2 2
ˆˆ
( 1) ( 1)
ˆ,
2
nn
n

12
n n n
,
1
1
ˆˆˆ
1
k
nT
k i k i k
i
k
n
xx

and
ˆ,
k
k
n
n
where
ˆ
is called the pooled covariance matrix.
2.2.2 Quadratic Discriminant Analysis (QDA)
The QDA requires the assumption of the unequal
covariance matrix as
12

, [17]. Therefore, the
distribution function of multivariate normal
distribution in (8) can be rewritten in form of
individual covariance as:
1
1 1 1 1
1
22
2
2
1
2
1
2
2
1,
2
2
T
T
p
p
exp
exp











xx
xx


(13)
The (12) takes the natural logarithm and
distributes an instance
x
from two sides as:
11
1 1 1
1 1 111
1
1 1 1
ln ln
2 2 2
T T T
x x x
22
1 1 1
2 2 22 2 1
1 1 1
ln ln
2 2 2
T T T
x x x
.(14)
We multiply two sides of the (14) by 2 and
adjust in perspicuous terms as
11 1 1 1
1 2 2 2 1 1 1 1 1 2 2 2
2
T T T
T
 xxx
1
2
2
1
ln 2ln 0.



 



(15)
From obtaining (15), it represents in the
quadratic form
0
TT
A b c x x x
, then we
express the decision boundary to the second classes
as:
11
1 2 2 2 1 1
11
1
1
12
1 2 2
ˆˆ
ˆˆ
()
ˆˆ
ˆ
2
ˆ ˆ ˆ
ˆˆ
T
TT
T



 xxxx

1
1
2
2
ˆˆ
ln 2ln .
ˆˆ






(16)
The classification term and estimated parameter
of QDA are defined as the linear discriminant
analysis on previous section.
2.2.3 Regularized Discriminant Analysis (RDA)
The LDA assumes that the individual of covariance
matrices of all classes are equal term by computing
the pooled covariance matrix (
). In particular, the
QDA requires the unequal covariance matrices, and
estimates the individual covariance matrix (
k
) for
classifying binary data sets. To improve the
estimating covariance matrix, Friedman, [11],
proposed the regularized method in the concept of
discriminant analysis called RDA. Specifically, the
covariance matrix is approximated by
ˆ ˆ ˆ
( ) (1 )
kk
, (17)
where
is defined the regularized parameter on
values
01

. It controls the shrinkage of the
individual covariance matrix and pooled covariance
WSEAS TRANSACTIONS on MATHEMATICS
DOI: 10.37394/23206.2023.22.15
Autcha Araveeporn
E-ISSN: 2224-2880
123
Volume 22, 2023
matrix. When the value
0
, there is no shrinkage
and rises to QDA. When the value
1
, there is
full shrinkage and yields LDA. The second
regularized parameter is
with
01

, which
controls shrinkage of the identity matrix:
ˆ ˆ ˆ
( , ) (1 ) ( ) tr[ ( )]
k k k
p
, k=1,2 (18)
where tr(.) is the trace of the individual
covariance matrix, I is the identity matrix and
is
an additional regularization parameter. The decision
boundary depends on two regularized parameters
( , )

for the sample class of covariance matix
beyond that proposed by (16), (17) and (18) through
12
11
2 2 1 1
11
11
2
1
2
2
1
1
1 2 2
( ) ( , ) ( , )
ˆˆ
ˆˆ
( , ) ( , )
ˆˆ
ˆ ˆ ˆ ˆ
( , ) ( , )
ˆ( , ) ˆ
ln 2ln .
ˆ
2
ˆ
( , )
ˆˆ
T
TT
T










 x
xx
x

(19)
The suitable estimates for selecting the pair
regularized parameter (
,
) are determined by
taking a different combination of
and
. For each
point of a decision boundary, the leave-one-out
cross-validation, [10], is necessary to calculate the
class of discriminant scores on (12) with the
observed data.
3 Simulation Data and Results
This study aimed to classify the binary response
variables from logistic regression, LDA, QDA, and
RDA. The independent variables (
x
) were
generated from the multivariate normal distribution
in two, four, six, and eight sets of independent
variables and constant correlation (
) values of 0.1,
0.5, and 0.9. The multivariate normal density
function of independent variables(
x
)consisted of
the mean (
) and the covariance matrix (
):
1
11
( ) exp 2
2
T
p
f



x | , x x
,
where
11
22
2
1 1 2 1
2
2 1 2 2
2
12
,,
... ...
...
...
... ... ... ...
...
, 2,4,6,8, 1,..., .
i
i
ip p
p
p
p p p
x
x
x
p i n
 
 
 




 




x
The mean (
) was defined to simulate at zero, and
the variance (
2
) was defined to simulate as 2 and
6. The parameter of correlation coefficients of logit
transformation was denoted by
01
( , ,..., )T
p
for the two, four, six, and
eight independent variables. Finally, the dependent
variables (
y
) were calculated from the logit function
() 1
i
i
x
ix
e
px e
in terms of the logistic
regression model.
If
( ) 0.5
i
px
, the dependent variables were
denoted as
1
i
y
, and
0
i
y
, when
( ) 0.5.
i
px
The R program was employed to
simulate data at 1,000 replications with a set of 200,
300, 400, and 500 sample sizes. The logistic
regression, LDA, QDA, and RDA methods
approximated decision bound parameters to predict
the binary dependent variables. The confusion
matrix was used to decide the performance method
in classification. The predicted data was
approximated to compare with the actual data by
using the percentage of accuracy (Table 1).
Table 1. The confusion matrix of actual data (
i
y
)
and predicted data (
ˆi
y
).
Predicted data
Actual data
1
i
y
0
i
y
ˆ1
ii
y
True Positive
(TP)
False Positive
(FP)
ˆ0
i
y
False Negative
(FN)
True Negative
(TN)
WSEAS TRANSACTIONS on MATHEMATICS
DOI: 10.37394/23206.2023.22.15
Autcha Araveeporn
E-ISSN: 2224-2880
124
Volume 22, 2023
Percentage of Accuracy 100.
TP TN
TP TN FP FN

The mean of percentage of accuracy of four
methods for two, four, six, and eight independent
variables are shown in Tables 2-5. The correlation
coefficient (0.1, 0.5, and 0.9) was generated from
multivariate normal distribution via 1,000
replications. The maximum of the average
percentage of accuracy was the performance method
in bold text.
Table 2. The average percentage of accuracy of Logistic Regression (LR), Linear Discriminant Analysis
(LDA), Quadratic Discriminant Analysis (QDA), and Regularized Discriminant Analysis (RDA) under 2
independent variables.
Table 3. The average percentage of accuracy of Logistic Regression (LR), Linear Discriminant Analysis
(LDA), Quadratic Discriminant Analysis (QDA), and Regularized Discriminant Analysis (RDA) under 4
independent variables.
Sample
Sizes
(
n
)
Relation
Coefficient
Variance = 4
Variance = 36
LR
LDA
QDA
RDA
LR
LDA
QDA
RDA
200
0.1
50.34
97.36
97.51
97.75
50.23
97.59
97.49
97.74
0.5
50.31
97.58
97.50
97.89
50.09
97.60
97.60
97.86
0.9
50.34
97.62
97.58
98.35
50.15
97.65
97.50
98.26
300
0.1
50.30
97.91
97.78
98.00
49.97
97.95
97.83
98.11
0.5
50.10
97.97
97.85
98.17
50.21
97.97
97.81
98.25
0.9
50.23
97.94
97.76
98.45
49.86
97.94
97.82
98.43
400
0.1
50.53
98.10
98.00
98.27
49.95
98.10
97.96
98.23
0.5
50.18
98.07
97.95
98.33
49.85
98.10
97.94
98.29
0.9
50.31
98.16
98.03
98.56
50.00
98.21
98.06
98.56
500
0.1
50.71
98.30
98.12
98.43
50.06
98.32
98.17
98.45
0.5
50.08
98.30
98.13
98.48
50.11
98.31
98.16
98.50
0.9
50.15
98.31
98.14
98.62
50.12
98.32
98.19
98.61
Sample
Sizes
(
n
)
Relation
Coefficient
Variance = 4
Variance = 36
LR
LDA
QDA
RDA
LR
LDA
QDA
RDA
200
0.1
100.00
98.08
98.23
98.28
100.00
98.15
98.18
98.29
0.5
100.00
98.23
98.27
98.48
100.00
98.12
98.27
98.40
0.9
99.99
98.19
98.24
98.49
100.00
98.15
98.22
98.45
300
0.1
100.00
98.45
98.47
98.62
100.00
98.47
98.55
98.61
0.5
100.00
98.39
98.44
98.68
100.00
98.45
98.50
98.70
0.9
100.00
98.36
98.45
98.68
100.00
98.37
98.45
98.63
400
0.1
100.00
98.55
98.62
98.72
100.00
98.58
98.66
98.75
0.5
99.99
98.57
98.62
98.80
99.99
98.67
98.69
98.91
0.9
100.00
98.58
98.63
98.81
100.00
98.60
98.66
98.84
500
0.1
100.00
98.77
98.84
98.94
100.00
98.75
98.78
98.90
0.5
100.00
98.70
98.75
98.92
100.00
98.72
98.76
98.95
0.9
100.00
98.74
98.77
98.93
100.00
98.73
98.71
98.92
WSEAS TRANSACTIONS on MATHEMATICS
DOI: 10.37394/23206.2023.22.15
Autcha Araveeporn
E-ISSN: 2224-2880
125
Volume 22, 2023
Table 4. The average percentage of accuracy of Logistic Regression (LR), Linear Discriminant Analysis
(LDA), Quadratic Discriminant Analysis (QDA), and Regularized Discriminant Analysis (RDA) under 6
independent variables.
Sample
Sizes
(
n
)
Relation
Coefficient
Variance = 4
Variance = 36
LR
LDA
QDA
RDA
LR
LDA
QDA
RDA
200
0.1
50.03
97.37
97.10
97.48
50.35
97.27
96.98
97.37
0.5
50.12
97.16
97.03
97.42
49.87
97.26
97.08
97.54
0.9
50.12
97.27
97.98
97.73
50.08
97.30
97.18
97.78
300
0.1
50.46
97.61
97.28
97.76
49.88
97.70
97.29
97.78
0.5
50.12
97.61
97.34
97.81
50.07
97.59
97.32
97.82
0.9
50.29
97.56
97.27
97.93
49.82
97.61
97.23
97.97
400
0.1
50.28
97.79
97.46
97.88
50.10
97.88
97.46
97.95
0.5
50.17
97.89
97.53
98.04
50.20
97.84
97.46
97.56
0.9
50.12
97.81
97.49
98.09
49.99
97.85
97.51
98.11
500
0.1
50.49
97.92
97.66
98.04
50.10
98.02
97.67
98.15
0.5
50.14
98.02
97.65
98.13
49.89
98.02
97.67
98.17
0.9
50.37
97.98
97.65
98.26
49.97
98.03
97.69
98.28
Table 5. The average percentage of accuracy of Logistic Regression (LR), Linear Discriminant Analysis
(LDA), Quadratic Discriminant Analysis (QDA), and Regularized Discriminant Analysis (RDA) under 8
independent variables.
Sample
Sizes
(
n
)
Relation
Coefficient
Variance = 4
Variance = 36
LR
LDA
QDA
RDA
LR
LDA
QDA
RDA
200
0.1
50.32
97.00
96.83
97.18
50.22
97.11
96.85
97.24
0.5
50.20
97.05
96.88
97.32
50.16
97.04
96.81
97.25
0.9
49.99
97.08
96.75
97.51
49.86
97.14
96.90
97.54
300
0.1
50.23
97.38
96.97
97.50
50.26
97.38
96.92
97.53
0.5
50.00
97.31
96.94
97.53
49.98
97.28
96.92
97.53
0.9
50.16
97.34
96.93
97.69
50.36
97.32
96.90
97.69
400
0.1
50.23
97.52
97.10
97.68
50.24
97.62
97.17
97.74
0.5
50.27
97.52
97.08
97.70
50.01
97.54
97.10
97.74
0.9
50.17
97.57
97.10
97.85
50.16
97.57
97.16
97.89
500
0.1
50.02
97.76
97.24
97.83
50.18
97.74
97.27
97.85
0.5
50.16
97.75
97.23
97.88
50.28
97.74
97.31
97.89
0.9
50.02
97.74
97.33
98.07
50.15
97.77
97.32
98.03
4 Application in Real Data
We applied four methods to classify liver and non-
liver patients from northeast Andhra Pradesh, India.
This data set was obtained from
https://archive.ics.uci.edu/ml/datasets/.
The independent variables were defined by the
albumin and globulin ratio (
1
x
), total proteins (
2
x
),
albumin (
3
x
), age (
4
x
), alkaline phosphatase (
5
x
),
total bilirubin (
6
x
), direct bilirubin (
7
x
), alanine
aminotransferase (
8
x
), and aspartate
aminotransferase (
9
x
). The binary dependent
variables contained 416 liver patient records and
167 non-liver patient records.
Pearson correlation coefficient measured the
strength relationship between two continuous
variables. The formula can be written as:
WSEAS TRANSACTIONS on MATHEMATICS
DOI: 10.37394/23206.2023.22.15
Autcha Araveeporn
E-ISSN: 2224-2880
126
Volume 22, 2023
1
22
11
( )( )
( ) ( )
n
ii
i
nn
ii
ii
x x y y
r
x x y y




.
The Pearson correlation coefficients of nine
independent variables are played in Table 6 and Fig.
1. The hypothesis testing used Student’s t-
distribution. The null and alternative hypothesis is
defined as:
01
: 0 , : 0HH


, and the testing for significance calculated by using
formula:
2
2
1
n
tr r
with a degree of freedom (df) n-2.
Finally, if the absolute t-value was greater than the
critical value or the p-value was less than the
significant level (0.05). It indicated that the
relationship was statistically significant, as shown in
Table 6.
Table 6. Pearson correlation coefficient on the statistically significant of nine independent variables.
From Table 6, it could be seen that a significant
positive relationship at a strong level was shown
such in
1
x
3
x
,
2
x
3
x
,
6
x
7
x
, and
8
x
9
x
. The
negative relationship at a moderate level was shown
in most cases such as
1
x
4
x
to
9
x
and
3
x
4
x
to
9
x
.
Fig. 1: The correlation plot of nine independent
variables.
The Pearson correlation coefficient matrix from
Table 6 is changed into Fig. 1, which is easily
understood using different colors. The dark blue and
red illustrate on high correlation, and the light blue
and red denote a low correlation. Most of the nine
independent variables have a light blue and red,
which means there was either correlation among the
nine independent variables or multicollinearity
problems. The logistic regression, LDA, QDA, and
RDA methods were used to classify and compute
the percentage of accuracy in Table 7. The datasets
of two, four, six, eight, and nine were similar to the
number of independent variables via the simulation
data. The two, four, six, and eight independent
variables were chosen, while the correlations were
statistically significant at 0.05 level.
Variables
1
x
2
x
3
x
4
x
5
x
6
x
7
x
8
x
9
x
1
x
1.000
0.234*
0.689*
-0.216*
-0.234*
-0.206*
-0.200*
-0.0023
-0.070
2
x
-
1.000
0.783*
-0.186*
-0.027
-0.0079
0.000032
-0.042
-0.025
3
x
-
-
1.000
-0.264*
-0.163*
-0.222*
-0.228*
-0.028
-0.084
4
x
-
-
-
1.000
0.078
0.011
0.0067
-0.087
-0.020
5
x
-
-
-
-
1.000
0.205*
0.234*
0.124*
0.166*
6
x
-
-
-
-
-
1.000
0.874*
0.213*
0.237*
7
x
-
-
-
-
-
-
1.000
0.233*
0.257*
8
x
-
-
-
-
-
-
-
1.000
0.791*
9
x
-
-
-
-
-
-
-
-
1.000
WSEAS TRANSACTIONS on MATHEMATICS
DOI: 10.37394/23206.2023.22.15
Autcha Araveeporn
E-ISSN: 2224-2880
127
Volume 22, 2023
Table 7. The percentage of accuracy on two, four, six, eight, and nine independent variables.
Table 7 shows that the logistic regression and
RDA show the highest percentage of accuracy in
significant cases. However, the percentage of
accuracy of the logistic regression method was a
good performance in all independent variables.
Furthermore, the RDA outperformed and showed an
equal percentage of accuracy in some cases. Since
either correlation coefficient among independent
variables is not fixed as in the simulation study, thus
some results are variant from the simulation study.
As the independent variables increased, the
accuracy percentage was slightly different. The
classification of liver and non-liver patients can use
some independent variables to classify dependent
variables, and it can save time and budget to collect
large independent variables.
5 Discussion
The simulation results are presented in Tables 2-5
by the average percentage of accuracy depending on
the independent variables and sample sizes. As
shown, the maximum accuracy was achieved when
logistic regression was used on the tiny independent
variables, while the more enormous independent
variables were highlighted with RDA.
In addition, the potential to increase the power of
classification for which independent variables are
small and the sample sizes are large. The correlation
coefficient appears not invariant because the
average percentage of accuracy played on slightly
different values. Since the covariance was an
unknown classification pattern, the two classes were
equiprobable, making the shortest distances to the
decision bound. When the sample size increased, the
accuracy of all methods increased in all cases.
From Table 7, the actual data results have
appeared that the logistic regression and RDA
methods outperformed all independent variables. It
could be seen that the independent variables of
actual data demonstrated in the form of skewness
(Fig. 2). The Shapiro-Wilk test, [18], was tested to
check that all independent variables showed non-
normality. However, the logistic regression and
RDA methods employed the classification of large
independent variables. The simulation data was
generated from the multivariate normal distribution,
depending on several correlations. In addition to the
actual data, results were difficult to control the
distribution and the correlation coefficient as the
simulation study. Multicollinearity is the leading
cause of bias in classification, [19]. Then both
results were different among actual data.
Meanwhile, the percentage of accuracy shows a
Number of
Independent
Variables
Independent Variables
LR
LDA
QDA
RDA
2
3
x
70.984
70.811
70.115
71.157
2
x
3
x
71.502
71.210
71.329
71.502
3
x
4
x
3
x
5
x
71.157
71.157
71.848
72.020
71.675
71.502
68.566
71.502
5
x
8
x
71.502
71.119
49.050
71.502
5
x
9
x
71.502
70.654
50.777
71.502
4
1
x
2
x
4
x
8
x
72.020
70.293
52.504
71.502
1
x
7
x
8
x
9
x
71.115
71.020
53.713
71.502
5
x
7
x
8
x
9
x
71.502
70.587
54.404
71.502
3
x
5
x
8
x
9
x
72.193
71.329
49.395
71.502
6
1
x
3
x
5
x
6
x
8
x
9
x
71.848
71.502
54.922
71.502
1
x
3
x
5
x
7
x
8
x
9
x
71.157
71.502
52.504
71.502
8
1
x
3
x
4
x
5
x
6
x
7
x
8
x
9
x
73.575
72.193
55.267
71.502
9
1
x
2
x
3
x
4
x
5
x
6
x
7
x
8
x
9
x
73.575
71.848
55.440
71.102
WSEAS TRANSACTIONS on MATHEMATICS
DOI: 10.37394/23206.2023.22.15
Autcha Araveeporn
E-ISSN: 2224-2880
128
Volume 22, 2023
slight between the two methods. The logistic
regression and RDA methods can reasonably
classify liver and non-liver patients.
Fig. 2: The histogram of nine independent variables.
The actual data based on the LDA method shows
the mathematical analysis. The example data of liver
and non-liver patients are played in Table 8.
Table 8. The observed data of globulin ratio,
albumin, and liver’s disease for 10 sample data.
Patient
Globulin Ratio
(
1
x
)
Albumin
(
3
x
)
Liver’s
Disease (
)Y
1
1
3.5
1
2
1.1
3.6
1
3
1.2
4.1
0
4
1
3.4
1
5
0.8
2.7
1
6
0.6
3
1
7
0.9
3.4
0
8
1
4.1
1
9
0.87
2.7
1
10
0.7
2.3
0
The first step, compute the mean and covariance
matrix of each group:
1
1 1.1 1 0.8 0.6 1 0.87
0.91
7
ˆ3.285
3.5 3.6 3.4 2.7 3 4.1 2.7
7
















,
2
1.2 0.9 0.7
0.933
3
ˆ3.266
4.1 3.4 2.3
3


















,
1
0.0283 0.0565
ˆ0.0565 0.2647




, and
1
0.0633 0.2216
ˆ0.2216 0.8233




.
Next, the pool covariance matrix (
) is
0.0371 0.0977
ˆ0.0977 0.4044




,
and the inverse of pool variance is
174.290 17.964
ˆ17.964 6.816




.
The probability of each group is
17
ˆ0.7
10

and
23
ˆ0.3
10

.
Finally, the decision boundary discriminant of LDA
is
2 1 2 1 2 1
2
1
1
3
11
ˆ ˆ ˆ ˆ ˆ ˆ
( ) 2
ˆ
2ln ˆ
4.15 1.098 1.6357
ˆTT
x
x







xx
.
The QDA and the RDA can approximate the same
way.
6 Conclusions
This paper describes personality classification by
applying logistic regression, LDA, QDA, and RDA
methods for classification in binary data. We
provided explanations of the benefit of these
methods. Given the empirical results of the
simulation study and small independent variables,
the logistic regression method outperformed in
classification. However, the large independent
variables result showed that RDA has the
performance classification.
When considering the correlation coefficient, all
methods did not improve the classification
performance. Moreover, the large sample sizes were
WSEAS TRANSACTIONS on MATHEMATICS
DOI: 10.37394/23206.2023.22.15
Autcha Araveeporn
E-ISSN: 2224-2880
129
Volume 22, 2023
highlighted across all cases. The actual data is used
to classify liver and non-liver patients, and nine
independent variables were obtained. The two, four,
six, and eight independent variables were selected
for several correlations. These results explained that
the logistic regression and RDA methods were
effective at classification in most cases that were
based on skewed data. Therefore, we concluded that
logistic regression and RDA methods could classify
the situation of multicollinearity data. For future
work, these methods can apply to machine learning
to defect the face, [20], and capture global structure
information, [21].
Acknowledgments:
This research is supported by King Mongkut’s
Institute of Technology Ladkrabang.
References:
[1] Lever, J, Krzywinski, M. Altman, N, Logistic
regression: regression can be used on
categorical responses to estimate probabilities
and to classify, Nature Methods, Vol. 3, No.
17, 2016, pp. 541-548.
[2] A. Arabameri, H. R. Pourghasemi, 13
Spatial Modeling of Gully Erosion Using
Linear and Quadratic Discriminant Analyses
in GIS and R, Spatial Modeling in GIS and R
for Earth and Environmental Sciences, Vol. 2,
2019, pp. 299-321.
[3] A. Sharma, K.K. Paliwal, Linear discriminant
analysis for the small sample size problem: an
overview, International Journal of Machine
Learning and Cybernetics, Vol. 3, 2015, pp.
443-454.
[4] Y. Guo, T. Hastie, R. Tibshirani, Regularized
linear discriminant analysis and its application
in microarrays, Biostatistics, Vol. 8, No. 1,
2007, pp. 86-100.
[5] M.A. Fernandez, C. Rueda, B. Salvador,
Incorporating additional information to
normal linear discriminant rules, Journal of
the American Statistical Association, Vol.
101, No. 474, 2006, pp. 569-577.
[6] A. Tharwat, T. Gaber, A. Ibrahim, A.E.
Hassanien, Linear discriminant analysis: A
detailed tutorial, AI Communications), Vol.
30, No.2, 2017, pp. 169-190.
[7] F. Zhu, J. Gao, J. Yang, N. Ye, Neighborhood
linear discriminant analysis, Pattern
Recognition, Vol. 123, 2022, 108422.
[8] S. Dudoit, J. Fridlyand, T.P Speed,
Comparison of discrimination methods for the
classification of tumors using gene
expression data, Journal of the American
Statistical Association, Vol. 94, No. 457,
2002, pp. 77-87.
[9] R.J. Rossi, Mathematical Statistics: An
Introduction to Likelihood Based Inference,
John Wiley & Sons, New York, 2018.
[10] A. Tharwat, Linear and quadratic
discriminant analysis classifier: a tutorial,
International Journal Applied Pattern
Recognition, Vol. 3, No.2, 2016, pp. 145-180.
[11] F.H. Friedman. Regularized discriminant
analysis, Journal of the American Statistical
Association, Vol. 84, 1889, pp. 165-175.
[12] I. Pima, M. Aladjem. Regularized
discriminant analysis for face recognition,
Pattern Recognition, Vol. 37, No. 9, 2004, pp.
1945-1948.
[13] K. Elkhalil, A. Kammoun, R. Couillet, T.Y.
Al-Naffouri, M.S Alouini, A large
dimensional study of regularized discriminant
analysis, IEEE Transections on Signal
Processing. Vol. 68, 2002, pp. 2464-2479.
[14] A. Ciampi, J. Courteau, T. Niyonsenga, M.
Xhignesse, L. Cacan, M. Roy, Family History
and the Risk of Coronary Heart Disease:
Comparing Predictive Model, European
Journal of Epidemiology, Vol. 17, No. 7,
2001, pp.609-620.
[15] C. Ngufor, J. Wojtusiak, Extreame logistic
regression, Advances in Data Analysis
and Classification, Vol. 10, 2016, pp. 27-52.
[16] S. Akram, Q. U. Ann, Newton Raphson
Method, International Journal of Scientific &
Engineering Research, Vol.6, No.7, 2015, pp.
1748-1752.
[17] B. Ghojogh, M. Crowley, Linear and
quadratic discriminant analysis: Tutorial,
2019. Available at http:// arXiv preprint
arXiv:1906.02590.
[18] Z. Hanusz, J. Tarasinska, W. Zielinski,
Shapiro-Wilk Test Known Mean, Revstat-
Statistical Journal, Vol. 14, No. 1, 2016, pp.
89-100.
[19] C. J. Lee, C.S. Park, J.S. Kim, J.G. Baek, A
Study on Improving Classification
Performance for Manufacturing Process Data
with Multicollinearity and Imbalanced
Distribution, Journal of Korean Institute of
WSEAS TRANSACTIONS on MATHEMATICS
DOI: 10.37394/23206.2023.22.15
Autcha Araveeporn
E-ISSN: 2224-2880
130
Volume 22, 2023
Industrial Engineers, Vol. 41, No. 1, 2015,
pp. 25-33.
[20] J. M. Hahne, F. BieBmann, N. Jiang, H.
Rehbaum, D. Farina, F.C. Meinecke, L. C.
Parra. Linear and nonlinear regression
techniques for simultaneous and proportional
myoelectric control, IEEE Transections on
Neural Systems and Rehabilitation
Engineering, Vol. 22, No. 2, 2014, pp. 269-
279.
[21] D. Zhang, Y. Zhau, M. Du. A novel
supervised feature extraction algorithm:
enhanced within-class linear discriminant
analysis, Computational Science and
Engineering, Vol. 13, No. 1, 2016, pp. 13-23.
WSEAS TRANSACTIONS on MATHEMATICS
DOI: 10.37394/23206.2023.22.15
Autcha Araveeporn
E-ISSN: 2224-2880
131
Volume 22, 2023
Contribution of Individual Authors to the
Creation of a Scientific Article (Ghostwriting
Policy)
The author contributed in the present research, at all
stages from the formulation of the problem to the
final findings and solution.
Sources of Funding for Research Presented in a
Scientific Article or Scientific Article Itself
Conflict of Interest
The author has no conflict of interest to declare that
is relevant to the content of this article.
Creative Commons Attribution License 4.0
(Attribution 4.0 International, CC BY 4.0)
This article is published under the terms of the
Creative Commons Attribution License 4.0
https://creativecommons.org/licenses/by/4.0/deed.en
_US
This research is supported by King Mongkut’s
Institute of Technology Ladkrabang.