Comparison of Logistic Regression and Discriminant Analysis for

Classification of Multicollinearity Data

AUTCHA ARAVEEPORN

Department of Statistics, School of Science,

King Mongkut's Institute of Technology Ladkrabang,

Bangkok, 10520,

THAILAND

Abstract: - The objective of this study is to concentrate on the classification method of the logistic regression

and the discriminant analysis by using the simulation dataset and the liver patients as the actual data. These

datasets are used the binary dependent variable depending on the correlated independent variables or called

multicollinearity data. The standard classification method is logistic regression, which uses the logit function's

probability to conduct the dichotomous dependent variable. The iteration process can be solved to estimate logit

function parameters and explain the relationship between a dependent binary variable and independent

variables. Discriminant analysis is a powerful classification based on linear discriminant analysis (LDA),

quadratic discriminant analysis (QDA), and regularized discriminant analysis (RDA). These methods consider

the decision boundaries by building a classifier model on the multivariate normal distribution. LDA defines the

standard covariance matrix, but QDA has an individual covariance matrix. RDA extends from QDA by setting

the regularized parameter to estimate the covariance matrix. In the case of the simulation study, the independent

variables are generated by defining the constant correlation on the multivariate normal distribution that made

the multicollinearity problem. Then the binary response variable can be approximated from the logit function.

For application to actual data, we expressed the classification of type liver and non-liver patients as the

dependent variables and obtained patient personal information on the nine independent variables. The highest

average percentage of accuracy determines the performance of these methods. The results have shown that the

logistic regression was successful when using small independent variables, but the RDA performed when using

large independent variables.

Key-Words: - linear discriminant analysis, quadratic discriminant analysis, regularized discriminant analysis

Received: October 19, 2022. Revised: December 17, 2022. Accepted: January 15, 2023. Published: February 16, 2023.

1 Introduction

The regression analysis has concern with describing

the relationship between the dependent variable and

one or more independent variables. The dependent

variable is continuous, but sometimes the outcome

variable is discrete. Then, the logistic regression has

become to solve this problem.

The studying of logistic regression is an

effective statistical technique used to find the best

fitting in biological and medical data. The logistic

regression model distinguishes from the regression

model that the outcome variable is binary or

dichotomous. Nevertheless, the difference between

logistic regression and regression analysis is the

modeling parameter and the assumption. Lever et

al., [1], expressed that the logistic regression was a

powerful tool for predicting the class by probability

and classified the binary independent variable.

The one assumption of logistic regression is no

multicollinearity problems between different

independent variables. It is often the case that the

independent variables play the correlation and make

a misleading conclusion. The importance of

multicollinearity data, [2], was studied in classifying

gully erosion by comparing the discriminant

analysis.

The basic idea of discriminant analysis helps

separate two or more groups of observed data and

creates decision boundaries in the form of linear and

quadratic for classification. The composition of

discriminant analysis consists of the categorical

dependent variable and the independent variables,

which rely on the multivariate normal distribution

with equal and unequal covariance. The usefulness

of classification by discriminant analysis was

developed in small sample sizes, [3], and a new

theory, [4]. The illustration of discriminant analysis

was to determine from the medical data that the

patients had suffered a heart attack, [5], to classify

whether the patient would survive or not survive

WSEAS TRANSACTIONS on MATHEMATICS

DOI: 10.37394/23206.2023.22.15

Autcha Araveeporn

E-ISSN: 2224-2880

120

Volume 22, 2023

based on independent variables. The discriminant

analysis is expanded to linear discriminant analysis

(LDA) and quadratic discriminant analysis (QDA).

LDA is a classification and dimensionality

reduction technique that can be explained from two

perspectives. The first idea is sounding a

probabilistic interpretation, and the second is the

creation of interpretation. Tharwat et al., [6], studied

the working of linear discriminant analysis to apply

in different applications. This experiment concluded

that LDA is robust for classification accuracy. Zhu

et al., [7], assumed that all data were independently

and identically distributed. The neighborhood linear

discriminant was proposed when the assumption did

not hold. Dudoit et al., [8], compared the

discriminant method's performance for tumor

classification using gene expression data. The

assumption of LDA has defined the equal variance

of all classes, and the decision boundary is

calculated as a linear function.

In particular, when the covariance matrix has an

individual, this leads to so-called QDA; and the

decision boundary is created in the quadratic

function. Under the multivariate normal distribution

and assuming the mean and covariance matrix for

each class, the parameter of the decision boundary

produces by the maximum likelihood method,

which becomes the unbiased estimator, [9].

Tharwat, [10], was to collect the essential

background of LDA and QDA in different

applications for classification. The discriminant

function and decision boundaries were highlighted

with numerical illustrations.

The improvement of LDA and QDA was to

regularize the individual covariance matrix.

Friedman, [11], proposed the regularized parameter

to control the covariance matrix's shrinkage, called

regularized discriminant analysis (RDA). Pima and

Aladjem, [12], studied the RDA in the classification

of face recognition and checked RDA sensitivity to

different methods of photometric preprocessing.

Elkhalil et al., [13], conducted a sizeable

dimensional experiment of RDA classifiers with its

two popular methods, known as regularized LDA

and QDA.

This research aims to investigate the classification

of logistic regression analysis, LDA, QDA, and

RDA. For this purpose, we propose the independent

variables under the multicollinearity problem and

the binary dependent variables whose conditional

distribution is from a multivariate normal

distribution. The actual data was applied to classify

the liver and non-liver patients from northeast

Andhra Pradesh, India, with eight independent

variables. Their percentage of accuracy determined

the performance of the four methods.

2 Classification Methods

Logistic regression and discriminant analysis are the

main methods to classify multicollinearity data.

2.1 Logistic Regression Method

The logistic regression model is created when the

dependent variable (

Y

) is of the binary or

dichotomic data based on the independent variable

(

X

). This methodology is to study medical diseases

such as the helpful in predicting the presence or

absence of evidence of coronary heart disease, [14].

The independent variable can be the continuous

variable or categorical data.

To understand the creation of logistic regression

model, start the condition probability of the

dependent variable given the independent variable

called

( | )P Y X

. The class of

Y

is denoted “1” as

success and “0” as failure, then

Y

becomes the

binary variable depending on the Bernoulli

distribution. One can verify that

( 1) ( )P Y E Y

,

and the conditional probability is

( 1| ) ( | )P Y X x E Y X x   

. It can assume

that

( 1| ) ( )P Y X x p x  

and

( 0| ) 1 ( )P Y X x p x   

. The likelihood

function is written by

1

11

( | ) ( ) (1 ( )) .

ii

nn

yy

i i i i

ii

P Y y X x p x p x 



   



(1)

The basic idea is to let

()px

be a linear

function, but there are an unbounded since

0 ( ) 1px

. The next idea is to let

log ( )px

be a

linear function, but there are an unbounded on

logarithm. Finally, the logit transformation

()

log1 ( )

px

px

is a linear function without unbounded

problems. This last choice is called logistic

regression. Formally, the logistic regression mode is

0 1 1

()

log ...

1 ( )

,

ii k ki

i

px xx

px

x

  



   





(2)

where



is the vector of coefficient of logit

transformation,

i

x

is the set of the independent

variable,

k

is the number of independent variables,

and

n

is the observed dataset.

From (1), The likelihood function is then

WSEAS TRANSACTIONS on MATHEMATICS

DOI: 10.37394/23206.2023.22.15

Autcha Araveeporn

E-ISSN: 2224-2880

121

Volume 22, 2023

1

( ) ( ) (1 ( )) .

ii

nyy

ii

i

l p x p x











(3)

The log-likelihood from (3) turns into

summation as

 

1

ln ( ) log( ( )) (1 )log(1 ( ))

()

log log(1 ( ))

1 ( )

log( ) log(1 )

( ) log(1 ) .

ii

i

n

i i i i

i

ni

ii

nxx

i

nx

ii

i

l y p x y p x

px

y p x

px

y e e

y x e







   





  















  







  







The maximum likelihood is estimated by differential

the log-likelihood with respect to the parameter



following

11

1

ln ( ) 1

1

( ( ; ))

i

nn

x

i i i

x

ii

n

i i i

i

le x y x

e

y p x x







  









. (4)

The (4) is not going to be zero, so the maximum

likelihood estimator cannot solve this formula. To

find the value of



that maximizes

log ( )l



by

using iterative techniques, [15]. Newton-Raphson

method studied by Akram and Ann, [16], that

played the performance method into the available

logistic regression model. The simplest case of

Newton-Raphson method is to minimize one scalar

variable

()f



and to find the global minimum

*



. It can assume that

()f



is smooth function and

*



is a regular interior minimum. Near the

minimum can approximate by a Taylor expansion:

2

2*

2

1

( ) ( *) ( *) |

2

df

ff d



    



  

, (5)

where

()f



close to quadratic near minimum.

Next, Newton’s method uses to minimize quadratic

approximation in (5). Guss an initial point

(0)



is

used to solve this problem by taking a second order

Taylor expansion around

(0)



following

(0)

(0) (0) 2

2

(0) 2

2

( ) ( ) ( ) |

1( ) |

2

df

ff dw

df

dw



   





  



. (6)

Let’s abbreviate the derivatives

(0) (0)

| ( )

df f

dw

 





and

(0)

2(0)

2| ( )

df f

dw

 





. From (6), this takes

the derivative with respect to



and set to

equal zero and called

(1)



:

(0) (0) (1) (0)

(0)

(1) (0)

(0)

1

0 ( ) ( )2( )

2

()

ff

f

   



 

 

  







.

The approximation is to get by using the iterative

process by minimizing one approximation:

()

( 1) ( )

()

.

()

n

nn

n

f



 







2.2 Discriminant Analysis Methods

The most often applied discriminant analysis are

depended on the multivariate normal distribution or

called

~ ( , )Nx



which is written the

probability distribution function in form of

 

   

1

( ) ,

2

T

p

f exp











  







xx

x | ,





(7)

where

12

( , ,..., )

p

= x x xx

presents the

independent variables,

12

( , ,..., )

p

=

  



represents the mean of independent variables,



presents the determinant of the covariance matrix,

and

1



presents an inverse of the covariance

matrix.

2.2.1 Linear Discriminant Analysis (LDA)

The covariance matrix of LDA assumes the equal of

the binary classification as

12

  

, [17].

Then, the probability distribution function from (7)

can be written as

 

   

 

   

11

1

22

2

1

2

1,

2

T

p

T

p

exp































xx



(8)

WSEAS TRANSACTIONS on MATHEMATICS

DOI: 10.37394/23206.2023.22.15

Autcha Araveeporn

E-ISSN: 2224-2880

122

Volume 22, 2023

where

1



and

2



are the probability of two

classes, and

1



and

2



are the mean of two

classes.

Using the natural logarithm in (8), the simplified

term can be written as:

1 1 1 1

1 1 1

11 ()

22

T T T ln



  

     x x x

  

2 2 2 2

1 1 1

11 ln( ),

22

T T T



  

      x x x

  

(9)

where (9) is

11

11TT

xx



, and

multiply two sides by two, we get:

 

   

 

2 1 2 1 2 1

2

1

11

2

2ln 0.

TT





   









x

     

(10)

Thus, the (10) can be seen in the form of

T

A x+ b = 0

which is called the LDA. The

decision boundary discriminates the two classes by

 

   

 

2 1 2 1 2 1

2

1

11

ˆ ˆ ˆ ˆ ˆ ˆ

( ) 2

ˆ

2ln .

ˆ

ˆTT







    







xx

     

(11)

The class corresponds to assign an observation

x

where

1 , ( ) 0

() 2 , ( ) 0

if

xif













x

. (12)

The parameters associated in (12) are estimated

the mean and covariance matrices by their sample

dataset as

1

ˆ, 1,2

k

n

i

k

n





x



1 1 2 2

ˆˆ

( 1) ( 1)

ˆ,

2

nn

n

    

 

12

n n n

,

  

1

ˆˆˆ

1

k

nT

k i k i k

i

k

n

   

xx



and

ˆ,

k

n





where

ˆ



is called the pooled covariance matrix.

2.2.2 Quadratic Discriminant Analysis (QDA)

The QDA requires the assumption of the unequal

covariance matrix as

12



, [17]. Therefore, the

distribution function of multivariate normal

distribution in (8) can be rewritten in form of

individual covariance as:

 

   

 

   

1

1 1 1 1

1

22

2

1

2

1

2

1,

2

T

p

exp





























xx



(13)

The (12) takes the natural logarithm and

distributes an instance

x

from two sides as:

 

11

1 1 1

1 1 111

1

1 1 1

ln ln

2 2 2

T T T



  

       x x x

  

 

22

1 1 1

2 2 22 2 1

1 1 1

ln ln

2 2 2

T T T



  

        x x x

  

.(14)

We multiply two sides of the (14) by 2 and

adjust in perspicuous terms as

 

   

11 1 1 1

1 2 2 2 1 1 1 1 1 2 2 2

2

T T T

T

   



     xxx

     

1

2

1

ln 2ln 0.









 







(15)

From obtaining (15), it represents in the

quadratic form

0

TT

A b c  x x x

, then we

express the decision boundary to the second classes

as:



 

11

1 2 2 2 1 1

11

1

12

1 2 2

ˆˆ

()

ˆˆ

ˆ

2

ˆ ˆ ˆ

ˆˆ

T

TT

T







  









 xxxx



   

1

2

ˆˆ

ln 2ln .

ˆˆ



















(16)

The classification term and estimated parameter

of QDA are defined as the linear discriminant

analysis on previous section.

2.2.3 Regularized Discriminant Analysis (RDA)

The LDA assumes that the individual of covariance

matrices of all classes are equal term by computing

the pooled covariance matrix (



). In particular, the

QDA requires the unequal covariance matrices, and

estimates the individual covariance matrix (

k



) for

classifying binary data sets. To improve the

estimating covariance matrix, Friedman, [11],

proposed the regularized method in the concept of

discriminant analysis called RDA. Specifically, the

covariance matrix is approximated by

ˆ ˆ ˆ

( ) (1 )

kk

  

     

, (17)

where



is defined the regularized parameter on

values

01





. It controls the shrinkage of the

individual covariance matrix and pooled covariance

WSEAS TRANSACTIONS on MATHEMATICS

DOI: 10.37394/23206.2023.22.15

Autcha Araveeporn

E-ISSN: 2224-2880

123

Volume 22, 2023

matrix. When the value

0





, there is no shrinkage

and rises to QDA. When the value

1





, there is

full shrinkage and yields LDA. The second

regularized parameter is



with

01





, which

controls shrinkage of the identity matrix:

ˆ ˆ ˆ

( , ) (1 ) ( ) tr[ ( )]

k k k

p



    

      

, k=1,2 (18)

where tr(.) is the trace of the individual

covariance matrix, I is the identity matrix and



is

an additional regularization parameter. The decision

boundary depends on two regularized parameters

( , )



for the sample class of covariance matix

beyond that proposed by (16), (17) and (18) through



 

12

11

2 2 1 1

11

2

1

2

1

1 2 2

( ) ( , ) ( , )

ˆˆ

( , ) ( , )

ˆˆ

ˆ ˆ ˆ ˆ

( , ) ( , )

ˆ( , ) ˆ

ln 2ln .

ˆ

2

ˆ

( , )

ˆˆ

T

TT

T

    

   

 











  

























 x

xx

x



   

(19)

The suitable estimates for selecting the pair

regularized parameter (



,



) are determined by

taking a different combination of



and



. For each

point of a decision boundary, the leave-one-out

cross-validation, [10], is necessary to calculate the

class of discriminant scores on (12) with the

observed data.

3 Simulation Data and Results

This study aimed to classify the binary response

variables from logistic regression, LDA, QDA, and

RDA. The independent variables (

x

) were

generated from the multivariate normal distribution

in two, four, six, and eight sets of independent

variables and constant correlation (



) values of 0.1,

0.5, and 0.9. The multivariate normal density

function of independent variables(

x

)consisted of

the mean (



) and the covariance matrix (



):

     

1

11

( ) exp 2

2

T

p

f







    







x | , x x

  

,

where

11

22

2

1 1 2 1

2

2 1 2 2

2

12

,,

... ...

...

... ... ... ...

...

, 2,4,6,8, 1,..., .

i

ip p

p

p p p

x

p i n



    

    

    

   

   



   

   





 







x



The mean (



) was defined to simulate at zero, and

the variance (

2



) was defined to simulate as 2 and

6. The parameter of correlation coefficients of logit

transformation was denoted by

01

( , ,..., )T

p

   



for the two, four, six, and

eight independent variables. Finally, the dependent

variables (

y

) were calculated from the logit function

() 1

i

x

ix

e

px e





in terms of the logistic

regression model.

If

( ) 0.5

i

px 

, the dependent variables were

denoted as

1

i

y

, and

0

i

y

, when

( ) 0.5.

i

px 

The R program was employed to

simulate data at 1,000 replications with a set of 200,

300, 400, and 500 sample sizes. The logistic

regression, LDA, QDA, and RDA methods

approximated decision bound parameters to predict

the binary dependent variables. The confusion

matrix was used to decide the performance method

in classification. The predicted data was

approximated to compare with the actual data by

using the percentage of accuracy (Table 1).

Table 1. The confusion matrix of actual data (

i

y

)

and predicted data (

ˆi

y

).

Predicted data

Actual data

1

i

y

0

i

y

ˆ1

ii

y

True Positive

(TP)

False Positive

(FP)

ˆ0

i

y

False Negative

(FN)

True Negative

(TN)

WSEAS TRANSACTIONS on MATHEMATICS

DOI: 10.37394/23206.2023.22.15

Autcha Araveeporn

E-ISSN: 2224-2880

124

Volume 22, 2023

Percentage of Accuracy 100.

TP TN

TP TN FP FN





  

The mean of percentage of accuracy of four

methods for two, four, six, and eight independent

variables are shown in Tables 2-5. The correlation

coefficient (0.1, 0.5, and 0.9) was generated from

multivariate normal distribution via 1,000

replications. The maximum of the average

percentage of accuracy was the performance method

in bold text.

Table 2. The average percentage of accuracy of Logistic Regression (LR), Linear Discriminant Analysis

(LDA), Quadratic Discriminant Analysis (QDA), and Regularized Discriminant Analysis (RDA) under 2

independent variables.

Table 3. The average percentage of accuracy of Logistic Regression (LR), Linear Discriminant Analysis

(LDA), Quadratic Discriminant Analysis (QDA), and Regularized Discriminant Analysis (RDA) under 4

independent variables.

Sample

Sizes

(

n

)

Relation

Coefficient

Variance = 4

Variance = 36

LR

LDA

QDA

RDA

LR

LDA

QDA

RDA

200

0.1

50.34

97.36

97.51

97.75

50.23

97.59

97.49

97.74

0.5

50.31

97.58

97.50

97.89

50.09

97.60

97.86

0.9

50.34

97.62

97.58

98.35

50.15

97.65

97.50

98.26

300

0.1

50.30

97.91

97.78

98.00

49.97

97.95

97.83

98.11

0.5

50.10

97.97

97.85

98.17

50.21

97.97

97.81

98.25

0.9

50.23

97.94

97.76

98.45

49.86

97.94

97.82

98.43

400

0.1

50.53

98.10

98.00

98.27

49.95

98.10

97.96

98.23

0.5

50.18

98.07

97.95

98.33

49.85

98.10

97.94

98.29

0.9

50.31

98.16

98.03

98.56

50.00

98.21

98.06

98.56

500

0.1

50.71

98.30

98.12

98.43

50.06

98.32

98.17

98.45

0.5

50.08

98.30

98.13

98.48

50.11

98.31

98.16

98.50

0.9

50.15

98.31

98.14

98.62

50.12

98.32

98.19

98.61

Sample

Sizes

(

n

)

Relation

Coefficient

Variance = 4

Variance = 36

LR

LDA

QDA

RDA

LR

LDA

QDA

RDA

200

0.1

100.00

98.08

98.23

98.28

100.00

98.15

98.18

98.29

0.5

100.00

98.23

98.27

98.48

100.00

98.12

98.27

98.40

0.9

99.99

98.19

98.24

98.49

100.00

98.15

98.22

98.45

300

0.1

100.00

98.45

98.47

98.62

100.00

98.47

98.55

98.61

0.5

100.00

98.39

98.44

98.68

100.00

98.45

98.50

98.70

0.9

100.00

98.36

98.45

98.68

100.00

98.37

98.45

98.63

400

0.1

100.00

98.55

98.62

98.72

100.00

98.58

98.66

98.75

0.5

99.99

98.57

98.62

98.80

99.99

98.67

98.69

98.91

0.9

100.00

98.58

98.63

98.81

100.00

98.60

98.66

98.84

500

0.1

100.00

98.77

98.84

98.94

100.00

98.75

98.78

98.90

0.5

100.00

98.70

98.75

98.92

100.00

98.72

98.76

98.95

0.9

100.00

98.74

98.77

98.93

100.00

98.73

98.71

98.92

WSEAS TRANSACTIONS on MATHEMATICS

DOI: 10.37394/23206.2023.22.15

Autcha Araveeporn

E-ISSN: 2224-2880

125

Volume 22, 2023

Table 4. The average percentage of accuracy of Logistic Regression (LR), Linear Discriminant Analysis

(LDA), Quadratic Discriminant Analysis (QDA), and Regularized Discriminant Analysis (RDA) under 6

independent variables.

Sample

Sizes

(

n

)

Relation

Coefficient

Variance = 4

Variance = 36

LR

LDA

QDA

RDA

LR

LDA

QDA

RDA

200

0.1

50.03

97.37

97.10

97.48

50.35

97.27

96.98

97.37

0.5

50.12

97.16

97.03

97.42

49.87

97.26

97.08

97.54

0.9

50.12

97.27

97.98

97.73

50.08

97.30

97.18

97.78

300

0.1

50.46

97.61

97.28

97.76

49.88

97.70

97.29

97.78

0.5

50.12

97.61

97.34

97.81

50.07

97.59

97.32

97.82

0.9

50.29

97.56

97.27

97.93

49.82

97.61

97.23

97.97

400

0.1

50.28

97.79

97.46

97.88

50.10

97.88

97.46

97.95

0.5

50.17

97.89

97.53

98.04

50.20

97.84

97.46

97.56

0.9

50.12

97.81

97.49

98.09

49.99

97.85

97.51

98.11

500

0.1

50.49

97.92

97.66

98.04

50.10

98.02

97.67

98.15

0.5

50.14

98.02

97.65

98.13

49.89

98.02

97.67

98.17

0.9

50.37

97.98

97.65

98.26

49.97

98.03

97.69

98.28

Table 5. The average percentage of accuracy of Logistic Regression (LR), Linear Discriminant Analysis

(LDA), Quadratic Discriminant Analysis (QDA), and Regularized Discriminant Analysis (RDA) under 8

independent variables.

Sample

Sizes

(

n

)

Relation

Coefficient

Variance = 4

Variance = 36

LR

LDA

QDA

RDA

LR

LDA

QDA

RDA

200

0.1

50.32

97.00

96.83

97.18

50.22

97.11

96.85

97.24

0.5

50.20

97.05

96.88

97.32

50.16

97.04

96.81

97.25

0.9

49.99

97.08

96.75

97.51

49.86

97.14

96.90

97.54

300

0.1

50.23

97.38

96.97

97.50

50.26

97.38

96.92

97.53

0.5

50.00

97.31

96.94

97.53

49.98

97.28

96.92

97.53

0.9

50.16

97.34

96.93

97.69

50.36

97.32

96.90

97.69

400

0.1

50.23

97.52

97.10

97.68

50.24

97.62

97.17

97.74

0.5

50.27

97.52

97.08

97.70

50.01

97.54

97.10

97.74

0.9

50.17

97.57

97.10

97.85

50.16

97.57

97.16

97.89

500

0.1

50.02

97.76

97.24

97.83

50.18

97.74

97.27

97.85

0.5

50.16

97.75

97.23

97.88

50.28

97.74

97.31

97.89

0.9

50.02

97.74

97.33

98.07

50.15

97.77

97.32

98.03

4 Application in Real Data

We applied four methods to classify liver and non-

liver patients from northeast Andhra Pradesh, India.

This data set was obtained from

https://archive.ics.uci.edu/ml/datasets/.

The independent variables were defined by the

albumin and globulin ratio (

1

x

), total proteins (

2

x

),

albumin (

3

x

), age (

4

x

), alkaline phosphatase (

5

x

),

total bilirubin (

6

x

), direct bilirubin (

7

x

), alanine

aminotransferase (

8

x

), and aspartate

aminotransferase (

9

x

). The binary dependent

variables contained 416 liver patient records and

167 non-liver patient records.

Pearson correlation coefficient measured the

strength relationship between two continuous

variables. The formula can be written as:

WSEAS TRANSACTIONS on MATHEMATICS

DOI: 10.37394/23206.2023.22.15

Autcha Araveeporn

E-ISSN: 2224-2880

126

Volume 22, 2023

1

22

11

( )( )

( ) ( )

n

ii

i

nn

ii

x x y y

r

x x y y







   



   

   





.

The Pearson correlation coefficients of nine

independent variables are played in Table 6 and Fig.

1. The hypothesis testing used Student’s t-

distribution. The null and alternative hypothesis is

defined as:

01

: 0 , : 0HH





, and the testing for significance calculated by using

formula:

2

1

n

tr r





with a degree of freedom (df) n-2.

Finally, if the absolute t-value was greater than the

critical value or the p-value was less than the

significant level (0.05). It indicated that the

relationship was statistically significant, as shown in

Table 6.

Table 6. Pearson correlation coefficient on the statistically significant of nine independent variables.

From Table 6, it could be seen that a significant

positive relationship at a strong level was shown

such in

1

x

3

x

,

2

x

3

x

,

6

x

7

x

, and

8

x

9

x

. The

negative relationship at a moderate level was shown

in most cases such as

1

x

4

x

to

9

x

and

3

x

4

x

to

9

x

.

Fig. 1: The correlation plot of nine independent

variables.

The Pearson correlation coefficient matrix from

Table 6 is changed into Fig. 1, which is easily

understood using different colors. The dark blue and

red illustrate on high correlation, and the light blue

and red denote a low correlation. Most of the nine

independent variables have a light blue and red,

which means there was either correlation among the

nine independent variables or multicollinearity

problems. The logistic regression, LDA, QDA, and

RDA methods were used to classify and compute

the percentage of accuracy in Table 7. The datasets

of two, four, six, eight, and nine were similar to the

number of independent variables via the simulation

data. The two, four, six, and eight independent

variables were chosen, while the correlations were

statistically significant at 0.05 level.

Variables

1

x

2

x

3

x

4

x

5

x

6

x

7

x

8

x

9

x

1

x

1.000

0.234*

0.689*

-0.216*

-0.234*

-0.206*

-0.200*

-0.0023

-0.070

2

x

-

1.000

0.783*

-0.186*

-0.027

-0.0079

0.000032

-0.042

-0.025

3

x

-

1.000

-0.264*

-0.163*

-0.222*

-0.228*

-0.028

-0.084

4

x

-

1.000

0.078

0.011

0.0067

-0.087

-0.020

5

x

-

1.000

0.205*

0.234*

0.124*

0.166*

6

x

-

1.000

0.874*

0.213*

0.237*

7

x

-

1.000

0.233*

0.257*

8

x

-

1.000

0.791*

9

x

-

1.000

WSEAS TRANSACTIONS on MATHEMATICS

DOI: 10.37394/23206.2023.22.15

Autcha Araveeporn

E-ISSN: 2224-2880

127

Volume 22, 2023

Table 7. The percentage of accuracy on two, four, six, eight, and nine independent variables.

Table 7 shows that the logistic regression and

RDA show the highest percentage of accuracy in

significant cases. However, the percentage of

accuracy of the logistic regression method was a

good performance in all independent variables.

Furthermore, the RDA outperformed and showed an

equal percentage of accuracy in some cases. Since

either correlation coefficient among independent

variables is not fixed as in the simulation study, thus

some results are variant from the simulation study.

As the independent variables increased, the

accuracy percentage was slightly different. The

classification of liver and non-liver patients can use

some independent variables to classify dependent

variables, and it can save time and budget to collect

large independent variables.

5 Discussion

The simulation results are presented in Tables 2-5

by the average percentage of accuracy depending on

the independent variables and sample sizes. As

shown, the maximum accuracy was achieved when

logistic regression was used on the tiny independent

variables, while the more enormous independent

variables were highlighted with RDA.

In addition, the potential to increase the power of

classification for which independent variables are

small and the sample sizes are large. The correlation

coefficient appears not invariant because the

average percentage of accuracy played on slightly

different values. Since the covariance was an

unknown classification pattern, the two classes were

equiprobable, making the shortest distances to the

decision bound. When the sample size increased, the

accuracy of all methods increased in all cases.

From Table 7, the actual data results have

appeared that the logistic regression and RDA

methods outperformed all independent variables. It

could be seen that the independent variables of

actual data demonstrated in the form of skewness

(Fig. 2). The Shapiro-Wilk test, [18], was tested to

check that all independent variables showed non-

normality. However, the logistic regression and

RDA methods employed the classification of large

independent variables. The simulation data was

generated from the multivariate normal distribution,

depending on several correlations. In addition to the

actual data, results were difficult to control the

distribution and the correlation coefficient as the

simulation study. Multicollinearity is the leading

cause of bias in classification, [19]. Then both

results were different among actual data.

Meanwhile, the percentage of accuracy shows a

Number of

Independent

Variables

Independent Variables

LR

LDA

QDA

RDA

2

1

x

3

x

70.984

70.811

70.115

71.157

2

x

3

x

71.502

71.210

71.329

71.502

3

x

4

x

3

x

5

x

71.157

71.848

72.020

71.675

71.502

68.566

71.502

5

x

8

x

71.502

71.119

49.050

71.502

5

x

9

x

71.502

70.654

50.777

71.502

4

1

x

2

x

4

x

8

x

72.020

70.293

52.504

71.502

1

x

7

x

8

x

9

x

71.115

71.020

53.713

71.502

5

x

7

x

8

x

9

x

71.502

70.587

54.404

71.502

3

x

5

x

8

x

9

x

72.193

71.329

49.395

71.502

6

1

x

3

x

5

x

6

x

8

x

9

x

71.848

71.502

54.922

71.502

1

x

3

x

5

x

7

x

8

x

9

x

71.157

71.502

52.504

71.502

8

1

x

3

x

4

x

5

x

6

x

7

x

8

x

9

x

73.575

72.193

55.267

71.502

9

1

x

2

x

3

x

4

x

5

x

6

x

7

x

8

x

9

x

73.575

71.848

55.440

71.102

WSEAS TRANSACTIONS on MATHEMATICS

DOI: 10.37394/23206.2023.22.15

Autcha Araveeporn

E-ISSN: 2224-2880

128

Volume 22, 2023

slight between the two methods. The logistic

regression and RDA methods can reasonably

classify liver and non-liver patients.

Fig. 2: The histogram of nine independent variables.

The actual data based on the LDA method shows

the mathematical analysis. The example data of liver

and non-liver patients are played in Table 8.

Table 8. The observed data of globulin ratio,

albumin, and liver’s disease for 10 sample data.

Patient

Globulin Ratio

(

1

x

)

Albumin

(

3

x

)

Liver’s

Disease (

)Y

1

3.5

1

2

1.1

3.6

1

3

1.2

4.1

0

4

1

3.4

1

5

0.8

2.7

1

6

0.6

3

1

7

0.9

3.4

0

8

1

4.1

1

9

0.87

2.7

1

10

0.7

2.3

0

The first step, compute the mean and covariance

matrix of each group:

1

1 1.1 1 0.8 0.6 1 0.87

0.91

7

ˆ3.285

3.5 3.6 3.4 2.7 3 4.1 2.7

7





     

















     













,

2

1.2 0.9 0.7

0.933

3

ˆ3.266

4.1 3.4 2.3

3





































,

1

0.0283 0.0565

ˆ0.0565 0.2647









, and

1

0.0633 0.2216

ˆ0.2216 0.8233









.

Next, the pool covariance matrix (



) is

0.0371 0.0977

ˆ0.0977 0.4044









,

and the inverse of pool variance is

174.290 17.964

ˆ17.964 6.816













.

The probability of each group is

17

ˆ0.7

10





and

23

ˆ0.3

10





.

Finally, the decision boundary discriminant of LDA

is

 

   

 

 

2 1 2 1 2 1

2

1

3

11

ˆ ˆ ˆ ˆ ˆ ˆ

( ) 2

ˆ

2ln ˆ

4.15 1.098 1.6357

ˆTT

x







    









  





xx

     

.

The QDA and the RDA can approximate the same

way.

6 Conclusions

This paper describes personality classification by

applying logistic regression, LDA, QDA, and RDA

methods for classification in binary data. We

provided explanations of the benefit of these

methods. Given the empirical results of the

simulation study and small independent variables,

the logistic regression method outperformed in

classification. However, the large independent

variables result showed that RDA has the

performance classification.

When considering the correlation coefficient, all

methods did not improve the classification

performance. Moreover, the large sample sizes were

WSEAS TRANSACTIONS on MATHEMATICS

DOI: 10.37394/23206.2023.22.15

Autcha Araveeporn

E-ISSN: 2224-2880

129

Volume 22, 2023

highlighted across all cases. The actual data is used

to classify liver and non-liver patients, and nine

independent variables were obtained. The two, four,

six, and eight independent variables were selected

for several correlations. These results explained that

the logistic regression and RDA methods were

effective at classification in most cases that were

based on skewed data. Therefore, we concluded that

logistic regression and RDA methods could classify

the situation of multicollinearity data. For future

work, these methods can apply to machine learning

to defect the face, [20], and capture global structure

information, [21].

Acknowledgments:

This research is supported by King Mongkut’s

Institute of Technology Ladkrabang.

References:

[1] Lever, J, Krzywinski, M. Altman, N, Logistic

regression: regression can be used on

categorical responses to estimate probabilities

and to classify, Nature Methods, Vol. 3, No.

17, 2016, pp. 541-548.

[2] A. Arabameri, H. R. Pourghasemi, 13 –

Spatial Modeling of Gully Erosion Using

Linear and Quadratic Discriminant Analyses

in GIS and R, Spatial Modeling in GIS and R

for Earth and Environmental Sciences, Vol. 2,

2019, pp. 299-321.

[3] A. Sharma, K.K. Paliwal, Linear discriminant

analysis for the small sample size problem: an

overview, International Journal of Machine

Learning and Cybernetics, Vol. 3, 2015, pp.

443-454.

[4] Y. Guo, T. Hastie, R. Tibshirani, Regularized

linear discriminant analysis and its application

in microarrays, Biostatistics, Vol. 8, No. 1,

2007, pp. 86-100.

[5] M.A. Fernandez, C. Rueda, B. Salvador,

Incorporating additional information to

normal linear discriminant rules, Journal of

the American Statistical Association, Vol.

101, No. 474, 2006, pp. 569-577.

[6] A. Tharwat, T. Gaber, A. Ibrahim, A.E.

Hassanien, Linear discriminant analysis: A

detailed tutorial, AI Communications), Vol.

30, No.2, 2017, pp. 169-190.

[7] F. Zhu, J. Gao, J. Yang, N. Ye, Neighborhood

linear discriminant analysis, Pattern

Recognition, Vol. 123, 2022, 108422.

[8] S. Dudoit, J. Fridlyand, T.P Speed,

Comparison of discrimination methods for the

classification of tumors using gene

expression data, Journal of the American

Statistical Association, Vol. 94, No. 457,

2002, pp. 77-87.

[9] R.J. Rossi, Mathematical Statistics: An

Introduction to Likelihood Based Inference,

John Wiley & Sons, New York, 2018.

[10] A. Tharwat, Linear and quadratic

discriminant analysis classifier: a tutorial,

International Journal Applied Pattern

Recognition, Vol. 3, No.2, 2016, pp. 145-180.

[11] F.H. Friedman. Regularized discriminant

analysis, Journal of the American Statistical

Association, Vol. 84, 1889, pp. 165-175.

[12] I. Pima, M. Aladjem. Regularized

discriminant analysis for face recognition,

Pattern Recognition, Vol. 37, No. 9, 2004, pp.

1945-1948.

[13] K. Elkhalil, A. Kammoun, R. Couillet, T.Y.

Al-Naffouri, M.S Alouini, A large

dimensional study of regularized discriminant

analysis, IEEE Transections on Signal

Processing. Vol. 68, 2002, pp. 2464-2479.

[14] A. Ciampi, J. Courteau, T. Niyonsenga, M.

Xhignesse, L. Cacan, M. Roy, Family History

and the Risk of Coronary Heart Disease:

Comparing Predictive Model, European

Journal of Epidemiology, Vol. 17, No. 7,

2001, pp.609-620.

[15] C. Ngufor, J. Wojtusiak, Extreame logistic

regression, Advances in Data Analysis

and Classification, Vol. 10, 2016, pp. 27-52.

[16] S. Akram, Q. U. Ann, Newton Raphson

Method, International Journal of Scientific &

Engineering Research, Vol.6, No.7, 2015, pp.

1748-1752.

[17] B. Ghojogh, M. Crowley, Linear and

quadratic discriminant analysis: Tutorial,

2019. Available at http:// arXiv preprint

arXiv:1906.02590.

[18] Z. Hanusz, J. Tarasinska, W. Zielinski,

Shapiro-Wilk Test Known Mean, Revstat-

Statistical Journal, Vol. 14, No. 1, 2016, pp.

89-100.

[19] C. J. Lee, C.S. Park, J.S. Kim, J.G. Baek, A

Study on Improving Classification

Performance for Manufacturing Process Data

with Multicollinearity and Imbalanced

Distribution, Journal of Korean Institute of

WSEAS TRANSACTIONS on MATHEMATICS

DOI: 10.37394/23206.2023.22.15

Autcha Araveeporn

E-ISSN: 2224-2880

130

Volume 22, 2023

Industrial Engineers, Vol. 41, No. 1, 2015,

pp. 25-33.

[20] J. M. Hahne, F. BieBmann, N. Jiang, H.

Rehbaum, D. Farina, F.C. Meinecke, L. C.

Parra. Linear and nonlinear regression

techniques for simultaneous and proportional

myoelectric control, IEEE Transections on

Neural Systems and Rehabilitation

Engineering, Vol. 22, No. 2, 2014, pp. 269-

279.

[21] D. Zhang, Y. Zhau, M. Du. A novel

supervised feature extraction algorithm:

enhanced within-class linear discriminant

analysis, Computational Science and

Engineering, Vol. 13, No. 1, 2016, pp. 13-23.

WSEAS TRANSACTIONS on MATHEMATICS

DOI: 10.37394/23206.2023.22.15

Autcha Araveeporn

E-ISSN: 2224-2880

131

Volume 22, 2023

Contribution of Individual Authors to the

Creation of a Scientific Article (Ghostwriting

Policy)

The author contributed in the present research, at all

stages from the formulation of the problem to the

final findings and solution.

Sources of Funding for Research Presented in a

Scientific Article or Scientific Article Itself

Conflict of Interest

The author has no conflict of interest to declare that

is relevant to the content of this article.

Creative Commons Attribution License 4.0

(Attribution 4.0 International, CC BY 4.0)

This article is published under the terms of the

Creative Commons Attribution License 4.0

https://creativecommons.org/licenses/by/4.0/deed.en

_US

This research is supported by King Mongkut’s

Institute of Technology Ladkrabang.