Cognitive States Classification Analysis
VIRGINIA VALCHEVA, OLGA GEORGIEVA
Faculty of Mathematics and Informatics,
Sofia University “St. Kliment Ohridski”,
BULGARIA
Abstract: - Alzheimer's disease is a chronic, prolonged, and irreversible neurodegenerative disease of unknown
cause. In recent years growing research interest assumes that by processing data of essential factors effective
models can be defined for recognizing and predicting the disease development. The present article aims to
propose classification models for the diagnosis of Alzheimer's disease cognitive states. For this aim medical
data of biomarkers and cognitive assessment data are used. The novelty of the paper is to explore both the
Amyloid/TAU/ Neurodegeneration framework and the biologically determined process of delay between the
brain impairment and visibility of its appearances by incorporating these concepts in the model development
procedure. The study explores the ability of three classifiers Random Forest, Extreme Gradient Boosting, and
Logistic Regression. Conclusion results have been done by comparison of the grouping abilities in different
data spaces. The practical result of the study is helping to determine medical examinations that give accurate
results for the diagnosis and prediction of the progression of the disease in possible earlier stages of the disease
development.
Key-Words: - Data Analysis, Machine Learning, Data Mining, Classification, Medical Data Analysis,
Alzheimer’s Disease.
Received: August 23, 2023. Revised: May 29, 2024. Accepted: July 16, 2024. Published: September 3, 2024.
1 Introduction
Alzheimer's disease (AD) is a chronic
neurodegenerative disease of unknown cause. The
disease is a severe, prolonged, and irreversible
condition that compromises social and professional
functioning. Various factors such as genetic burden,
lifestyle, and environment can contribute to its
appearance and development, [1], [2].
In recent years growing research interest
assumes that by processing data of essential factors
effective models can be defined for recognizing and
predicting disease development. The factor
dependence models can help professionals in
searching for unknown factors’ relationships and
disease knowledge. Having such models at earlier
disease stages can be helpful for developing
prevention strategies and help in managing the
problems of the sick, [3], [4], [5].
A large part of the investigations in this
direction are focused on the analysis of brain
Magnetic Resonance Imaging (MRI) as a good and
reliable data source about the presence of the
disease. A sophisticated statistical analysis
procedure was implemented on diffusion-weighted
MRI to detect changes in the white matter regions of
the brain, [6]. Based on logistic regression analysis
of genotype data it is concluded that Alzheimer’s
disease has a significant polygenic component,
which has predictive utility for the disease risk, [7].
In answering the aim for identification of the
dependency model a number of recent publications
show the applicability and benefit of machine
learning methods. The risk of Alzheimer's disease is
analyzed based on data from various demographic,
clinical examinations, and genetic factors, showing
that age, cognitive function assessments, and
specific biomarkers are important in the disease
diagnosis. Three different machine learning
approaches Support Vector Machine (SVM),
eXtreme gradient boosting of decision trees, and
Artificial neural network are used to identify blood
biomarkers used to improve the model predictivity
for incident dementia, [8]. Classification models that
analyze speech patterns detect early signs of
Alzheimer's disease by analyzing features such as
pauses, hesitation, and word-finding difficulties in
speech samples to predict the possibility of
Alzheimer's disease, [1]. The study [9] uses the
kernel combination method of SVM to discriminate
between AD or Mild cognitive impairment (MCI)
and healthy controls using three modalities of
biomarkers. Another study compares the different
performances of three machine learning algorithms -
Random Forest, Gradient Boosting, and eXtreme
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2024.21.38
Virginia Valcheva, Olga Georgieva
E-ISSN: 2224-3402
409
Volume 21, 2024
Gradient Boosting algorithms using biomarkers of
MCI classified factors to predict MCI to AD
conversion. The highest accuracy was achieved
using neuropsychological and Alzheimer-related
biomarkers and cognitive tests, [10]. Authors of [11]
show that the SVM algorithm successfully separate
patients with AD from healthy aging subjects. It
concludes that a combination of MRI features and
demographics could predict AD with high accuracy.
Other studies rely on classification techniques to
recognize disease cognitive groups by dealing with
different data sets. Thus, several classifiers -
GaussianNB, Decision Tree, Random Forest,
XGBoost, Voting Classifier, and GradientBoost
have been explored to predict Alzheimer's disease
and demonstrate the potential of this approach, [3].
To train the models the authors use the Open Access
Series of Imaging Studies (OASIS) data set and
show the beneficial outcome with the voting
classifier. The research of [5] employs a convolution
NN for training and a Random Forest Classifier,
KNeighborsClassifier, XGBClassifier, and Logistic
Regression for testing and classification algorithms.
This study looks at how different types of machine
learning algorithms can be used to solve AD
diagnostic challenges using a range of imaging
modalities employed to diagnose Alzheimer’s
disease. Our recent investigation confirms the best
performance of three classifiers namely Random
Forest, Extreme Gradient Boosting, and Logistic
Regression for AD diagnosis, [12]. It could be
summarized that classification algorithms are
successful tools for the recognition and prediction of
Alzheimer's disease using different types of data,
including MRI images, EEG signals, and
biomarkers.
At the same time, there are still open questions
that can be solved by machine learning methods.
Thus, a deep and wide understanding of the existing
interdependence between disease factors, disease
symptoms, and appeared cognitive states as well as
the respective description models is still under
ongoing investigation purpose. In such aim, in
recent years, a growing consensus on the critical
importance of the timing of intervention and the
need to initiate antiamyloid treatment during the
presymptomatic stages of the disease has emerged,
[13].
The present paper aims to investigate
classification models for the recognition of
cognitive states of Alzheimer's disease. A novelty of
the paper is in exploring the concept of
Amyloid/Tau/Neurodegeneration (A/T/N)
framework improving the feature selection of the
classification model. In addition, the biologically
determined process of delay between the brain
impairment and visibility of its appearances is also
accounted for in the feature selection improving the
model accuracy. The study explores the ability of
three classifiers – Random Forest, Extreme Gradient
Boosting, and Logistic Regression, that have already
been proven to perform better than others for
cognitive impairment recognition. Conclusion
results have been done by comparison of the
grouping abilities in the different data spaces
formed. The practical result of the proposed
investigation is effective models for diagnosis and
prediction of the illness progression in possible
earlier stages of its development
2 Data Set
Data used in the preparation of this article were
obtained from the Alzheimer’s Disease
Neuroimaging Initiative (ADNI) database, [14]. The
ADNI was launched in 2003 as a public-private
partnership, led by Principal Investigator Michael
W. Weiner, MD. The primary goal of ADNI has
been to test whether serial magnetic resonance
imaging (MRI), positron emission tomography
(PET), other biological markers, and clinical and
neuropsychological assessment can be combined to
measure the progression of mild cognitive
impairment (MCI) and early Alzheimer’s disease
database, [14]. ADNI provides open access data of a
wide range of clinical data collected over the years
that are related to Alzheimer's disease and its
inherent cognitive disorders. Nevertheless, ADNI
has been primarily initiated to research the disease
according to the brain image data as MRI, here our
focus is on three different types of Alzheimer's
examinations namely demographic, biomarker, and
cognitive data. The good reason for this search is
last medical investigations show that brain proteins
serve as biomarkers for the disease and in
combination with some demographic parameters
they are disease preconditions, [13], [15], [16]. On
the other hand, cognitive examinations are
commonly used in medical practice being a solid
base for the diagnosis and prediction of cognitive
impairments. These examinations are not invasive,
do not need special medical equipment, and are
easily applicable. The present study uses the
following data types.
The data of demographic parameters as
information on participants’ age (AGE),
gender (PTGENDER), and education
(PTEDUCAT) as well as genetic risk data as
Body Mass Index (BMI) and gene APOE4.
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2024.21.38
Virginia Valcheva, Olga Georgieva
E-ISSN: 2224-3402
410
Volume 21, 2024
Biomarkers information based on
Cerebrospinal fluid and plasma analysis of the
proteins β-amyloid (ABETA), total tau
(TAU), and phospho-tau (PTAU), as well
blood examinations of fluorodeoxyglucse
(FDG) of glucose metabolism measure are
known as most significant markers that
indicate the disease presence.
Cognitive examinations of various
neuropsychological and neuropsychiatric tests
of specific questions and observations
estimate the cognitive functions of the
different domains - memory, visuospatial,
executive, and language. The most used is
MMSE for neurodegenerative assessment as a
commonly accepted test of cognitive function.
Clinical Dementia Rating Scale (CDRS) and
Clinical Dementia Rating Sum of Boxes
(CDRSB) are widely used for assessing the
severity of dementia in patients. Activities
Questionnaire (FAQ) measures the ability to
perform everyday activities. Alzheimer's
Disease Assessment Scale (ADAS) is a set of
tests that assess various aspects of cognitive
function such as memory, language, and
orientation. Data from the Long Delay Free
Recall Total (LDELTOTAL) test for the
memory and neuropsychological test Rey
Auditory Verbal Learning Test (RAVLT) are
also used as data for neurodegenerative
assessments.
Data on the clinical diagnosis of the cognitive
state - normal cognition (CN), mild cognitive
impairment (MCI), and Alzheimer's disease (AD),
are also provided in ADNI. At only first visit the
participants were diagnosed with five cognitive
states: MCI is distinguished as Early mild cognitive
impairment (EMCI) and Late mild cognitive
impairment (LMCI). Significant memory concern
(SMC) is a condition noted as a cognitive problem,
but not diagnosed as Alzheimer's. At their next
visits, the subjects from SMC are relegated to CN.
3 Methodology
The applied research approach follows a data
mining procedure consisting of the following
successive steps: data preprocessing, feature
selection, data classification, and result analysis.
The specificity of each stage and the particular
techniques applied are presented below.
3.1 Data Preprocessing
Despite the data described in the previous section
being a subset of ADNI data still preprocessing is
important to apply. The problem of missing
examination data and data of diagnosis is
accomplished by filtering to ensure a fully
processible data set. The remaining amount of data
for further processing varies within the data spaces
formed after the feature selection stage discussed in
the next subsection.
Data normalization and transformation of
categorical to numeric data are other tasks of the
preprocessing stage. Min-max normalization is
applied in order to solve the scaling problems.
Categorical data are diagnosis data and some
demographic data such as PTGENDER and
PTEDUC. Appropriately a respective numerical
value is written instead.
3.2 Feature Selection
The importance of this stage is determined by the
need to select significant attributes that form a data
space, where the cognitive groups could be well
separated. There is no full information about the
dependency between the features or their role in
determining the cognitive state. Due to the existing
diversity and amount of disease factors and
biomarkers a feature selection algorithm needs to be
applied to find features most relevant to the
classification task.
In this study, we extend the feature selection
investigations by forming and investigating different
feature spaces in seeking the most informative one.
First, we apply the standard approach to this task.
The feature selection algorithm SelectKBest selects
the best k features that are most informative for
predicting the target variable of disease diagnosis.
The evaluation function assesses the relevance of
each feature by calculating an ANOVA F-value, that
measures the linear dependency between the feature
and the target value in the classification task.
The disadvantage of this approach is that feature
selection is done according to the medical diagnosis.
In medical practice most of the diagnosis rely on
data of cognitive tests, which are not expensive and
not invasive examinations. These examinations do
not present the current brain impairment but the
disease appearance. In answering this problem, we
extend the feature search by adopting the
Amyloid/Tau/Neurodegeneration framework as a
valuable evidence of the biological state of AD,
[13], [15], [16]. Amyloid-beta (ABETA) is a protein
fragment that is produced naturally in the brain, but
in Alzheimer's disease, it tends to accumulate and
form plaques, that disrupt communication between
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2024.21.38
Virginia Valcheva, Olga Georgieva
E-ISSN: 2224-3402
411
Volume 21, 2024
brain cells. Elevated level of ABETA is considered
one of the disease biomarkers. In Alzheimer's
disease, tau proteins, which play a crucial role in
stabilizing neuronal structures, undergo
modifications, such as phosphorylation.
Phosphorylated tau (PTAU) forms tangles inside
brain cells that disrupt normal neuronal function and
contribute to cognitive impairment. In [15] the
neurodegenerative status is estimated by MRI
analysis. However, often in the medical practice the
Neurodegenerative status is examined by combining
assessments of cognitive tests, [10], [17]. By taking
advantage of these results and trying to avoid the
expensive and difficult-to-apply examinations here
we adopt the cognitive data in the A/T/N
framework. Thus, the space formed by
ABETA/PTAU/Cognitive assessments is
investigated for being an informative data space of
cognitive group classification.
In forming the informative data space, we
explore as well other knowledge for Alzheimer's
disease. The dynamics of the disease, including the
asymptomatic period, proceed with the deposition of
the amyloid- peptide in the brain, triggering the so-
called "amyloid cascade", [13]. Obviously, the time
delay in the onset of the disease relative to the
asymptomatic accumulation of amyloid plaques
must be considered. To answer of this, we
investigate the classification abilities of space
formed by ABETA/PTAU/Cognitive assessments,
where the cognitive tests are done in late time then
biomarkers examinations. Figure 1 summarizes the
three approaches for forming the data spaces that are
further investigated for classification analysis.
Fig. 1: Strategies for data space definition
3.3 Classification
Our recent investigation [12] based on the
considered data set shows that three classifiers
among seven ones, covering at large the diversity of
the known classification approaches, are most
presented. The two of them - Random Forest (RF),
and Extreme Gradient Boosting (XGB) are based on
the decision tree classification concept but with
respective substantial improvement. RF is an
ensemble learning method of multiple decision trees
aggregating their predictions. XGB applies gradient
boosting algorithm. The third method is extended
version of Logistic Regression (LR) that deals with
multiclassification task of statistical estimation of
relationship between the features and the diagnosis
outcome. Here, those three classifiers are used to
solve the research aim.
Training the classifiers allow to learn the
patterns and relationships between selected features
and target diagnosis It is based on training dataset
that consists a part of the available data. Adjusting
hyperparameters of each classifier such as learning
rate or tuning optimize the model performance.
Cross-validation by StratifiedKFold algorithm is
applied in order to ensure reliable training avoiding
the imbalance of the data in the distinct classes. It
provides such that each split contains approximately
the same proportion of instance data of each class as
the full data set.
Each classification model has to be further
assessed for predicting ability by classification of
the test data data that are not used for training.
This proves the model's applicability to new data. In
order to form the test data, we took the next visits
data. Each classifier is run several times for
randomly generated and in an equal ratio of training
and test sets.
3.4 Accuracy Evaluation of the Classification
Models
Assessment of the performance of each trained
classification model is evaluated by metrics
Precision (P), Recall (R), F1 score, and Average
Accuracy (AA):
𝑃 = 𝑇𝑃/(𝑇𝑃 +𝐹𝑃) 
𝑅 = 𝑇𝑃/(𝑇𝑃 +𝐹𝑁) 
𝐹1 = 2 𝑃 𝑅/(𝑃 + 𝑅) 
𝐴𝐴 = (𝑇𝑁 +𝑇𝑃)/(𝑇𝑁 +𝐹𝑃 +𝑇𝑃 +𝐹𝑁), 
where the accuracy metrics are counted by the
number of true positive (TP), false positive (FP),
false negative (FN), and true negative (TN) cases.
The accuracy assessment results for the training
data runs and for the testing data runs are
respectively averaged. The best performed classifier
is considered in terms of all metrics and of both data
sets.
Area under the curve (AUC) is also used as an
accuracy measure. ROC AUC compares the relation
between the True Positive Rate and the False
Positive Rate. It typically includes the Precision rate
Feature
selection
strategy
Feature
selection
algorithm
A/T/N framework
with time
dependency
account
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2024.21.38
Virginia Valcheva, Olga Georgieva
E-ISSN: 2224-3402
412
Volume 21, 2024
calculated by equation (1) on the ordinate and the
False Positive Rate (FPR), where FPR=1-P, on the
X-axis. In order to evaluate the accuracy of multi-
class classifiers the One-vs-the-Rest multiclass
strategy, also known as one-vs-all, is applied. It
consists of computing the ROC AUC curve for each
of the classes. The larger area under the ROC AUC
curve means better classification.
4 Data Analysis Results
Initially, amount of 2370 participants’ data was
examined at their first visit for all considered
features. The results of the preprocessing stage are
discussed in the frame of the investigations of the
respective data space.
4.1 Data Spaces
Following the considerations of the previous section
feature selection algorithm is applied for first-visit
data where diagnosis data were given for the three
cognitive states – CN, MCI, and AD. By setting k=5
of the SelectKBest algorithm implemented by
Python five features of cognitive assessments are
selected. They are cognitive test data of MMSE,
CDRSB, FAQ, ADAS13, and LDELTOTAL that
define the Feature Space A (FS_A). The selection of
only cognitive tests as significant features could be
explained with the applied selection function. It
finds attributes most correlated with the diagnosis.
Bearing in mind that cognitive tests are most used in
the practice for Alzheimer's diagnosis it could be
supposed that diagnoses are much correlated with
cognitive assessments.
The second data space to be investigated is
determined by the A/T/N framework. We consider
feature space defined by the proteins’ biomarkers
and some cognitive assessment of the first visit data.
MMSE cognitive test is one of most applicable
cognitive assessments. Thus, Feature space B
(FS_B) formed by ABETA, PTAU, and MMSE is
the second examined data space.
In order to set a data space by cognitive data
obtained in late time we first investigated which
time delay period is most appropriate for this aim. It
could be seen that the number of changes of
diagnosis is most often in 24-nd month (Table 1).
Data from 481 subjects at their 24-month visit are
used for investigation of the data space. In this
space, the late cognitive assessment values in
regards to the biomarkers values are taken. Thus,
Feature space C (FS_C) is formed by the first visit
data of ABETA, PTAU, and 24-th month
assessments of one of the cognitive tests MMSE,
CDRSB, FAQ noted as MMSE_24, CDRSB_24,
FAQ_24, respectively. The three tests have been
discovered as significant ones by the feature
selection algorithm.
4.2 Classification Analysis
The three classification models were trained for the
three defined data spaces. As at all visits except the
first one the participants were diagnosed in three
groups the classifiers were trained to distinguish the
three classes namely CN, MCI, and AD. The trained
classifiers were evaluated in regard to their ability to
classify the test data sets. The test data sets were
formed by the data of the 12-th month visit. As far
as some participants do not have examinations at
this visit the test set has been accordingly reduced
for each examined data space.
Table 1. Number of the changed diagnosis
Diagnoses changed
from CN to MCI
Diagnoses changed
from MCI to AD
Period
/months/
Number
of subjects
Period
/months/
Number
of subjects
6
12
6
46
12
9
12
72
24
23
18
36
36
10
24
75
48
12
36
48
72
10
48
29
108
16
72
10
120
7
108
20
Data space FS_A is defined by the estimations
of cognitive tests MMSE, CDRSB, FAQ, ADAS13,
LDELTOTAL. It consists data of from 2320
participants. They were divided into 2088 training
and 232 testing sets used for the training stage. The
trained classifiers were further applied to the test
data. Table 2 presents the result (rounded values) of
the accuracy metrics (1)-(4) obtained through the
three classifiers for both training and test sets. The
corresponding averaged metrics values are shown as
well. The maximal metrics values are given in bold.
According to the A/T/N framework, the
investigated data space is FS_B which is formed by
ABETA, PTAU, and MMSE examination data.
After filtering due to missing diagnosis and
examinations 1541 data remain for the training.
Data of 320 participants examined at the 12th month
visit serve as a test set. The accuracy metrics values
and respective their averaged values are presented in
Table 3.
We to pay a special attention to the third
discussed data space noted as FS_C. It is formed
according to the novelty concept of exploring both
the A/T/N framework and accounting for the
biologically determined process of delay between
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2024.21.38
Virginia Valcheva, Olga Georgieva
E-ISSN: 2224-3402
413
Volume 21, 2024
brain impairment and visibility of its appearances.
Thus, the three classifiers were trained in the space
FS_C that have been formed by varying different
cognitive test data. Training results for each of the
interested classifiers in the spaces formed by
ABETA, PTAU data of the first visit and by
respectively: a) MMSE_24 having 1064 data; b)
CDRSB_24 (1054 data); c) FAQ_24 (1045 data)
and d) by two estimations CDRSB and CDRSB_24
(1054 data) are presented at Table 4.
Table 2. Accuracy metrics values of classifiers’
performance in FS_A data space
Classifier
P
R
F
1
A
A
Accuracy metrics values for the training
data set
RF
0
,944
0
,943
0
,943
0
,943
LR
0
,917
0
,916
0
,915
0
,916
XGB
0
,929
0
,928
0
,928
0
,928
Accuracy metrics values for the testing
data set
RF
0
,813
0
,833
0
,820
0
,818
LR
0
,818
0
,835
0
,824
0
,821
XGB
0
,814
0
,833
0
,821
0
,820
Average accuracy metrics values
RF
0
,878
0
,888
0
,882
0
,880
LR
0
,867
0
,875
0
,87
0
,868
XGB
0
,872
0
,881
0
,875
0
,874
Table 3. Accuracy metrics values of classifiers’
performance in FS_B data space
Classifier
P
R
F1
AA
Accuracy metrics values for the training data set
RF
0,611
0,605
0,605
0,605
LR
0,640
0,621
0,622
0,621
XGB
0,614
0,607
0,607
0,607
Accuracy metrics values for the testing data set
RF
0,607
0,603
0,603
0,601
LR
0,689
0,563
0,543
0,611
XGB
0,588
0,616
0,592
0,588
Average accuracy metrics values
RF
0,609
0,604
0,604
0,603
LR
0,665
0,592
0,583
0,616
XGB
0,601
0,611
0,599
0,597
Table 4. Accuracy metrics values of classifiers’
performance in FS_C data space
RF classification results
Data space
P
R
F1
AA
a) ABETA, PTAU,
MMSE_24
0,574
0,571
0,569
0,571
b) ABETA, PTAU,
FAQ_24
0,653
0,650
0,648
0,650
c) ABETA, PTAU,
CDRSB_24
0,828
0,8198
0,820
0,8198
d) ABETA, PTAU,
CDRSB, CDRSB_24
0,860
0,854
0,853
0,854
LR classification results
Data space
P
R
F1
AA
a) ABETA, PTAU,
MMSE_24
0,595
0,594
0,579
0,594
b) ABETA, PTAU,
FAQ_24
0,711
0,6996
0,686
0,6996
c) ABETA, PTAU,
CDRSB_24
0,813
0,806
0,802
0,806
d) ABETA, PTAU,
CDRSB, CDRSB_24
0,848
0,842
0,840
0,842
XGB classification results
Data space
P
R
F1
AA
a) ABETA, PTAU,
MMSE_24
0,575
0,567
0,567
0,567
b) ABETA, PTAU,
FAQ_24
0,653
0,647
0,647
0,647
c) ABETA, PTAU,
CDRSB_24
0,822
0,814
0,815
0,814
d) ABETA, PTAU,
CDRSB, CDRSB_24
0,846
0,8396
0,839
0,8396
5 Results Analysis and Discussion
Comparison concerning the classifier's ability to
distinguish the data space shows close partition
performance of the three investigated classifiers.
However, it should be underlined that Random
Forest outperforms the rest two classifiers having
accuracies over 0,94 for the training data set and the
highest averaged metrics values in FS_A data space
(Table 2). The Random Forest model is also best
performed in FS_C data space (Table 4). Logistic
Regression outperforms in the testing task of space
FS_A (Table 2) and in the training task of FS_B
(Table 3). However, testing and averaged accuracy
metrics of FS_B space do not show any favorite
performance.
The experimental results give information about
the classification and predictability characteristics of
the three examined feature spaces. It is a base to
draw conclusions about the applicability of the three
studied strategies for feature selection. The feature
space FS_A formed by the cognitive tests
examinations is the most informative one as it
outperforms the accuracy values that are over 0,8
for both training and testing sets (Table 2). However
again, it should be underlined that assessing only by
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2024.21.38
Virginia Valcheva, Olga Georgieva
E-ISSN: 2224-3402
414
Volume 21, 2024
the cognitive tests means to diagnose the disease at
the time of its visible appearance and not at its early
stage.
On the other hand, the accuracy results of the
data space FS_C for data spaces c) and d) consisting
of CDRSB test as neurodegenerative assessment are
fully commensurable with those of FS_A as the
accuracy presented is also over 0,8 (Table 4). The
accuracy is maximized if two data of CDRSB taken
in different times examinations are used to form the
data space. This proves the vitality of the idea using
A/T/N framework with accounting for the delay of
the cognitive tests data with respect to the
biomarkers data to diagnosis and predicting
Alzheimer's disease.
The obtained results are confirmed also by
metrics of the Area under the curve. It is represented
for RF classification for the different FS_C data
spaces. Figure 2, Figure 3, Figure 4, Figure 5 and
Figure 6 show the entire accuracy (Micro-average)
and accuracy reached for each class (the three
classes 0, 1, and 2 are shown, respectively). The
areas of the curves in Figure 5 and Figure 6 are
maximal. The obtained results confirm the
preference of data space defined by biomarkers
ABETA and PTAU data and CDRSB test data
obtained 24 months after than biomarkers data.
Besides these achievements, it should be
emphasized the good classification ability of the
CDRSB test in comparison with the rest
investigated. This could be explained by its
properties as it tries to assess all aspects of the
cognitive impairment. The recommendation is that it
be used alone and not in the battery of cognitive
tests as usual, [18].
6 Conclusion
The study presents a supervised data mining
procedure for answering the important task of early-
stage recognition of Alzheimer's disease for the
need for planning and effective care. ADNI data is
applied as a reliable medical data set for the
classification analysis.
The practical result of the study is helping to
determine medical examinations that give accurate
results for the diagnosis and prediction of the
progression of the disease in possible earlier stages
of the disease development. For this, the feature
selection stage is deeply considered a crucial stage
in determining an informative data space where the
cognitive data groups could be reliably
distinguished. The vitality of the A/T/N framework
applied to form the data space is shown. In addition,
the novelty concept to improve the model accuracy
by accounting for the delay in the cognitive tests’
information with respect to the biomarkers data is
proved. It is shown that the data space formed by
the two important biomarkers ABETA and PTAU
and cognitive test CDRSB data obtained 24 months
after the biomarkers presents significant accuracy of
the cognitive group distinguishing.
The comparison analysis of three known and
well-performed classifiers Random Forest, Logistic
Regression, and XGBoost for being classification
models is investigated. The preferences of the
Random Forest classifier are shown.
Fig. 2: AUC-ROC curve of Random Forest
classifier applied to ABETA, PTAU, MMSE_24
data space
Fig. 3: AUC-ROC curve of Random Forest
classifier applied to ABETA, PTAU, FAQ_24 data
space
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2024.21.38
Virginia Valcheva, Olga Georgieva
E-ISSN: 2224-3402
415
Volume 21, 2024
Fig. 4: AUC-ROC curve of Random Forest
classifier applied to ABETA, PTAU, CDRSB_24
data space
Fig. 5: AUC-ROC curve of Random Forest
classifier applied to ABETA, PTAU, FAQ_24,
CDRSB_24 data space
Fig. 6: AUC-ROC curve of Random Forest
classifier applied to ABETA, PTAU, CDRSB,
CDRSB_24 data space
Acknowledgement:
This research work has been supported by GATE
project, funded by the Horizon 2020
WIDESPREAD-2018-2020 TEAMING Phase 2
program under grant agreement no.857155 and by
the European Union-NextGenerationEU, through
the National Recovery and Resilience Plan of the
Republic of Bulgaria, project SUMMIT BG-RRP-
2.004-0008-C01 and funded by Science Fund of
Sofia University by project no. 80-10-137/2024.
References:
[1] Fraser, K. C., Meltzer, J. A., & Rudzicz, F.
(2016). Linguistic Features Identify
Alzheimer's Disease in Narrative Speech.
Journal of Alzheimer's disease: JAD, 49(2),
407–422. https://doi.org/10.3233/JAD-
150520.
[2] Hugo, J., & Ganguli, M. (2014). Dementia
and cognitive impairment: epidemiology,
diagnosis, and treatment. Clinics in geriatric
medicine, 30(3), 421–442.
https://doi.org/10.1016/j.cger.2014.04.001.
[3] Uddin, K. M. M., Alam, M. J., Jannat-E-
Anawar, Uddin, M. A., & Aryal, S. (2023). A
Novel Approach Utilizing Machine Learning
for the Early Diagnosis of Alzheimer's
Disease. Biomedical materials & devices
(New York, N.Y.), 1–17. Advance online
publication. https://doi.org/10.1007/s44174-
023-00078-9.
[4] Shrivastava, R.K., Singh, S.P., Kaur, G.
(2023). Machine Learning Models for
Alzheimer’s Disease Detection Using OASIS
Data. In: Koundal, D., Jain, D.K., Guo, Y.,
Ashour, A.S., Zaguia, A. (eds) Data Analysis
for Neurodegenerative Disorders. Cognitive
Technologies. Springer, Singapore, 111-126.
https://doi.org/10.1007/978-981-99-2154-6_6.
[5] Sentamilselvan, K., Swetha, J., Sujitha, M.,
Vigasini, R. (2022). Alzheimer’s Disease
Detection Using Machine Learning and Deep
Learning Algorithms. In: Abraham, A., et al.
Innovations in Bio-Inspired Computing and
Applications. IBICA 2021. Lecture Notes in
Networks and Systems, 419. Springer, Cham.
https://doi.org/10.1007/978-3-030-96299-
9_29.
[6] Zhang, Y., Schuff, N., Ching, C., Tosun, D.,
Zhan, W., Nezamzadeh, M., Rosen, H. J.,
Kramer, J. H., Gorno-Tempini, M. L., Miller,
B. L., & Weiner, M. W. (2011). Joint
assessment of structural, perfusion, and
diffusion MRI in Alzheimer's disease and
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2024.21.38
Virginia Valcheva, Olga Georgieva
E-ISSN: 2224-3402
416
Volume 21, 2024
frontotemporal dementia. International
journal of Alzheimer's disease, 2011, 546871.
https://doi.org/10.4061/2011/546871.
[7] Escott-Price, V, Sims, R, Bannister, C,
Harold, D, Vronskaya, M, Majounie, E,
Badarinarayan, N, Morgan, K, Passmore, P,
Holmes, C, Powell, J, Brayne, C, Gill, M,
Mead, S, Goate, A, Cruchaga, C, Lambert, JC,
Duijn, C, Maier, W, Ramirez, A, Holmans, P,
Jones, L, Hardy, J, Seshadri, S, Schellenberg,
GD, Amouyel, P, Williams, J, Gerad, P &
Consortia, I 2015, Common polygenic
variation enhances risk prediction for
Alzheimer's disease, Brain, 138, pp. 3673-
3684.
https://doi.org/10.1093/brain/awv268.
[8] Lin, H., Himali, J. J., Satizabal, C. L., Beiser,
A. S., Levy, D., Benjamin, E. J., Gonzales, M.
M., Ghosh, S., Vasan, R. S., Seshadri, S., &
McGrath, E. R. (2022). Identifying Blood
Biomarkers for Dementia Using Machine
Learning Methods in the Framingham Heart
Study. Cells, 11(9), 1506.
https://doi.org/10.3390/cells11091506.
[9] Zhang, D., Wang, Y., Zhou, L., Yuan, H.,
Shen, D., & Alzheimer's Disease
Neuroimaging Initiative (2011). Multimodal
classification of Alzheimer's disease and mild
cognitive impairment. NeuroImage, 55(3),
856–867.
https://doi.org/10.1016/j.neuroimage.2011.01.
008.
[10] Franciotti, R., Nardini, D., Russo, M., Onofrj,
M., Sensi, S. L., Alzheimer's Disease
Neuroimaging Initiative, & Alzheimer's
Disease Metabolomics Consortium ADMC
(2023). Comparison of Machine Learning-
based Approaches to Predict the Conversion
to Alzheimer's Disease from Mild Cognitive
Impairment. Neuroscience, 514, 143–152.
https://doi.org/10.1016/j.neuroscience.2023.0
1.029.
[11] Klöppel, S., Stonnington, C. M., Chu, C.,
Draganski, B., Scahill, R. I., Rohrer, J. D.,
Fox, N. C., Jack, C. R., Jr, Ashburner, J., &
Frackowiak, R. S. (2008). Automatic
classification of MR scans in Alzheimer's
disease. Brain, 131(Pt 3), 681–689.
https://doi.org/10.1093/brain/awm319.
[12] Valcheva V., Georgieva O., (2023). Data
Classification Analysis for Alzheimer Disease
Diagnostic, 27th International Conference on
Circuits, Systems, Communications and
Computers (CSCC), Rhodes Island, Greece,
2023, 153-159.
IEEE.
https://doi.org/10.1109/CSCC58962.2023.000
32.
[13] Jack, C. R., Jr, Bennett, D. A., Blennow, K.,
Carrillo, M. C., Dunn, B., Haeberlein, S. B.,
Holtzman, D. M., Jagust, W., Jessen, F.,
Karlawish, J., Liu, E., Molinuevo, J. L.,
Montine, T., Phelps, C., Rankin, K. P., Rowe,
C. C., Scheltens, P., Siemers, E., Snyder, H.
M., Sperling, R., Contributors (2018).
NIA-AA Research Framework: Toward a
biological definition of Alzheimer's disease.
Alzheimer's & dementia: the journal of the
Alzheimer's Association, 14(4), 535–562
https://doi.org/10.1016/j.jalz.2018.02.018.
[14] Alzheimer’s Disease Neuroimaging Initiative,
[Online]. https://adni.loni.usc.edu (Accessed
Date: July 1, 2024).
[15] Calvin, C. M., de Boer, C., Raymont, V.,
Gallacher, J., Koychev, I., & European
Prevention of Alzheimer’s Dementia (EPAD)
Consortium (2020). Prediction of Alzheimer's
disease biomarker status defined by the 'ATN
framework' among cognitively healthy
individuals: results from the EPAD
longitudinal cohort study. Alzheimer's
research & therapy, 12(1), 143.
https://doi.org/10.1186/s13195-020-00711-5.
[16] Jack, C. R., Jr, Bennett, D. A., Blennow, K.,
Carrillo, M. C., Feldman, H. H., Frisoni, G.
B., Hampel, H., Jagust, W. J., Johnson, K. A.,
Knopman, D. S., Petersen, R. C., Scheltens,
P., Sperling, R. A., & Dubois, B. (2016).
A/T/N: An unbiased descriptive classification
scheme for Alzheimer disease biomarkers.
Neurology, 87(5), 539–547.
https://doi.org/10.1212/WNL.0000000000002
923.
[17] Chaves, M. L. F., Godinho, C. C., Porto, C.
S., Mansur, L., Carthery-Goulart, M. T.,
Yassuda, M. S., Beato, R., & Group
Recommendations in Alzheimer’s Disease
and Vascular Dementia of the Brazilian
Academy of Neurology (2011). Cognitive,
functional and behavioral assessment:
Alzheimer's disease. Dementia &
neuropsychologia, 5(3), 153–166.
https://doi.org/10.1590/S1980-
57642011DN05030003.
[18] ADNI3 Procedures Manual, Version 3.0,
Alzheimer's disease neuroimaging initiative 3:
Defining Alzheimer's disease, Keck School of
Medicine of USC, [Online].
https://adni.loni.usc.edu/wp-
content/uploads/2024/02/ADNI3_Procedures_
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2024.21.38
Virginia Valcheva, Olga Georgieva
E-ISSN: 2224-3402
417
Volume 21, 2024
Manual_v3.0_29Feb2024.pdf, July 2024
(Accessed Date: July 5, 2024).
Contribution of Individual Authors to the
Creation of a Scientific Article (Ghostwriting
Policy)
The authors equally contributed in the present
research, at all stages from the formulation of the
problem to the final findings and solution.
Sources of Funding for Research Presented in a
Scientific Article or Scientific Article Itself
No funding was received for conducting this study.
Conflict of Interest
The authors have no conflicts of interest to declare.
Creative Commons Attribution License 4.0
(Attribution 4.0 International, CC BY 4.0)
This article is published under the terms of the
Creative Commons Attribution License 4.0
https://creativecommons.org/licenses/by/4.0/deed.en
_US
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2024.21.38
Virginia Valcheva, Olga Georgieva
E-ISSN: 2224-3402
418
Volume 21, 2024