Stroke prediction model based on XGBoost algorithm
WENWEN HE, HONGLI LE, PENGCHENG DU
School of Artificial Intelligence, Liangjiang
Chongqing University of Technology
Chongqing
CHINA
Abstract: In this paper, individual sample data randomly measured are preprocessed, for example, outliers values
are deleted and the characteristics of the samples are normalized to between 0 and 1. The correlation analysis
approach is then used to determine and rank the relevance of stroke characteristics, and factors with poor
correlation are discarded. The samples are randomly split into a 70% training set and a 30% testing set. Finally,the
random forest model and XGBoost algorithm combined with cross-validation and grid search method are
implemented to learn the stroke characteristics. The accuracy of the testing set by the XGBoost algorithm is
0.9257, which is better than that of the random forest model with 0.8991. Thus, the XGBoost model is selected
to predict the stroke for ten people, and the obtained conclusion is that two people have a stroke and eight people
have no stroke.
Key-Words: Correlation coefficient; Stroke; Random forest model; XGBoost algorithm
Received: March 11, 2022. Revised: October 13, 2022. Accepted: November 9, 2022. Published: December 13, 2022.
1 Introduction
A stroke is a group of symptoms caused by blockages
or bleeding in blood arteries that supply the brain.
Most stroke patients are ischemic stroke, the essence
of which is that there is a blood clot in the
cerebrovascular that blocks the blood vessel, or the
blood vessel becomes narrow, in order that the blood
supply to the brain tissue is not available, which will
cause the hypoxia or death of some brain cells [1].
This event, also known as cerebral infarction, results
in death or malfunction. Both early prevention and
timely treatment are crucial. For stroke, a particular
disease, a lot of research is based on machine learning
methods [2]. In this paper, the random forest model
and XGBoost algorithm are used to study stroke
prevention and understand the relationship between
stroke and individual characteristics.
2 Methods
2.1 Data preprocessing
A total of 4,861 people without a stroke make up the
data set's 5110 random samples, which comprise 12
attributes including number, age, and gender. Thus,
there is an imbalance in the data set, which is solved
by the Smote upsampling method in this paper.
Additionally, using data on age, sex, and current
body mass index values, decision tree regression is
utilized to estimate the missing BMI values.The
outliers of all genders except male and female in the
gender column are removed. The module of
OneHotEncoder is used to encode character variables,
and then the features of numerical variables are
normalized between 0 and 1.
2.2 Feature selection
A scatter plot is drawn to examine the linear
relationship between the data as shown in Figure 1.
As demonstrated in Figure 1, the scatter plot has a
large number of category dummy variables and rank
variables that do not accurately represent the
connections between the arrays.
Moreover, quantitative data columns such as "age",
"average glucose content grade" and "body mass
index" do not show a linear relationship with each
other, thus the correlation coefficient method is used
to solve the correlation calculation [3].
At the 95% and 90% confidence levels, a two-
sided test is run. The test determines if the estimated
value fits into the rejection or acceptance domain. If
the confidence level is less than 90%, the original
hypothesis is rejected and denoted with a "*." If the
confidence level is less than 95%, the original
hypothesis is rejected and denoted with "* *."Then,
the correlation coefficients between individual
characteristics affecting stroke are obtained and
shown in Table 3.
Table 3 The correlation coefficient of the
features
Stroke
Gender 0.009 0.516
Age 0.250** 0.000
Hypertension 0.128** 0.000
Heart disease 0.135** 0.000
Married -0.108** 0.000
Work Type -0.038** 0.007
Type Residence -0.015 0.269
Average glucose content 0.083** 0.000
Body mass index 0.055** 0.000
Smoking_status -0.067** 0.000
International Journal of Applied Sciences & Development
DOI: 10.37394/232029.2022.1.2
Wenwen He, Hongli Le, Pengcheng Du
E-ISSN: 2945-0454
7
Volume 1, 2022
Table 3 shows that P>0.05 means no significant
difference, 0.01<p<0.05 means a significant
difference, and p<0.01 means an extremely
significant difference[4]. Age is a significant positive
correlation factor, followed by heart disease and
hypertension, respectively. The average glucose level
is also a major part of the influencing factors,
followed by body mass index, which is all
significantly positive correlation factors. In the
hierarchy of stroke causes, gender and type of
habitation are at the bottom of the positive correlation
coefficient, and there is no significant correlation.
For whether married, smoking level ,and type of
work, they show a significant negative correlation.
3 Solution to the Problem
After data preprocessing ,the data set is randomly
divided, 70% of which is used as a training set and
30% of which is used as a testing set. We study the
dichotomous classification problem of stroke disease,
which is divided into two classes: stroke and no
stroke. The dataset for this study is unbalanced since
249 persons had strokes whereas 4861 people do not,
giving a data ratio of almost one to twenty. For the
unbalanced data set, this paper uses SMOTE
upsampling method to increase the number of a few
stroke samples to obtain a relatively balanced
training set. SMOTE is considered one of the most
influential data sampling algorithms in machine
learning and data mining[5]. In addition, the accuracy
rate, recall rate, and classification report are used as
performance evaluation indexes to evaluate the
subsequent prediction results accordingly.
Cross-validation is a statistical method used to
evaluate the performance of generalization. Cross-
validation is a data resampling method to assess the
generalization ability of predictive models and to
prevent overfitting[6]. The main concept is to divide
the original data into two groups, one for training and
one for validation. The classifier is first trained with
the training set, and then tested with the validation
set, which serves as the performance metric for
evaluating the classifier.
The random forest algorithm and XGBoost
algorithm are modeled, and the sample data are
grouped and trained by the k-fold cross-validation
method with k=10 selected. The training data are
substituted into the random forest algorithm and the
extreme gradient boosting tree model for training
fitting. The comparison of model scores can be
obtained by 10-fold cross-validation training in Table
Table 4 Scores of random forest model and
XGBoost model
Random Forest
0.9592
XGBoost
0.9562
The average score of the random forest model is
0.9592, and the average score of the XGBoost model
is 0.9562, So the two models have similar scores.
Table 5 f1 scores of random forest model and
extreme gradient lifting tree model
Random Forest
0.1739
XGBoost
0.1754
The random forest model f1 score of 0.1739 is larger
than the XGBoost model f1 score of 0.1754, so the
random forest model performs is a little better, but
the f1 scores are not good.
Tables 6 and 7 report the classification of the random
forest model and the XGBoost tree model,
respectively.
Table 6 Classification result of random forest
model
precision
recall
F1-
score
support
0.94
0.94
0.94
1198
0.17
0.17
0.17
80
0.00
0.00
0.90
1278
0.56
0.56
0.56
1278
0.90
0.90
0.90
1278
0.8953
Table 7 Classification result of XGBoost model
precision
recall
F1-
score
support
0.94
0.98
0.96
1198
0.29
0.12
0.18
80
0.00
0.00
0.93
1278
0.62
0.55
0.57
1278
0.90
0.93
0.91
1278
0.9264
By comparing the accuracy scores of the two models,
the XGBoost model is higher than the random forest
model.
3.1 Model optimization
The random forest model and XGBoost model under
K-fold cross-validation are used for stroke
International Journal of Applied Sciences & Development
DOI: 10.37394/232029.2022.1.2
Wenwen He, Hongli Le, Pengcheng Du
E-ISSN: 2945-0454
8
Volume 1, 2022
prediction, and the prediction results of the stroke in
the test set are obtained in Table 8, which show the
accuracy comparison between the test set and the
training set of the two prediction models.
Figure 1 scatter diagram Figure 2 roc curve
Table 8 Accuracy of random forest model and
XGBoost model
Random
forest
XGBoost
Training Set
Accuracy
1.0
1.0
Accuracy of
the test set
0.8960
0.9264
Table 8 shows the accuracy of the test set and the
training set of the random forest model and the
xgboost model. The accuracy of the training set of the
two models is 1, but the accuracy of the test set of the
random forest model is lower than that of the xgboost
model.
In order to determine the classification performance
of the model, we can use the ROC curve to visualize
the prediction results. From the above figure, we can
see that the AUC value of the random forest model is
0.025 larger than that of the XGBoost model, and the
AUC values of the two models are between 0.7 and
0.85, indicating that the prediction effect of the model
is average.
Grid search is a parameter tuning method used to find
the optimal parameters. The traditional method of
hyperparameters optimization is a grid search, which
simply makes a complete search over a given subset
of the hyperparameters space of the training
algorithm[7]. Grid search is used to find the optimal
parameters for the stroke prediction model based on
random forest and XGBoost algorithms. The results
of the grid search are shown in Table 9 for the optimal
parameters of the random forest model and Table 10
for the optimal parameters of the XGBoost model,
respectively.
Table 9 Optimal parameters of random forest
model
Parameter Name
Parameter Value
bootstrap
False
max_features
2
n_estimators
200
Table 10 Optimal parameters of XGBoost
Parameter Name
Parameter Value
booster
gbtree
learning_rate
0.1
max_depth
5
min_child_weight
1
subsample
0.8
colsample_bytree
1
reg_lambda
1
n_estimators
100000
The model was built again with the latest parameters
and the following classification report scores were
obtained as follows.
International Journal of Applied Sciences & Development
DOI: 10.37394/232029.2022.1.2
Wenwen He, Hongli Le, Pengcheng Du
E-ISSN: 2945-0454
9
Volume 1, 2022
Table 11 Classification result of the random
forest model
precision
recall
F1-
score
support
0
0.94
0.95
0.95
1198
1
0.15
0.14
0.15
80
Accuracy
0.00
0.00
0.90
1278
Macro
avg
0.55
0.54
0.55
1278
Weighted
avg
0.89
0.90
0.90
1278
Accuracy
score
0.8991
Table 12 Classification result of XGBoost
model
precision
recall
F1-
score
support
0
0.94
0.98
0.96
1198
1
0.29
0.12
0.18
80
Accuracy
0.00
0.00
0.93
1278
Macro
avg
0.62
0.55
0.57
1278
Weighted
avg
0.90
0.93
0.91
1278
Accuracy
score
0.9264
By training and optimizing the model and comparing
the accuracy of the 2 models, we finally choose the
XGBoost model with a higher score.
3.2 Realization of stroke prediction
The XGBoost stroke prediction model is selected to
predict strokes for a new dataset. The dataset has 10
individuals and it includes 12 characteristics such as
number, age, gender ,and so on. The dataset is
preprocessed to achieve standardization and then
input to the model for prediction, and the following
prediction results were obtained
Table 13 Prediction Results
Id
stroke
0
0
1
0
2
0
3
0
4
0
5
0
6
0
7
1
8
0
9
0
The results showed one-tenth of the predictors of
stroke. The predictor is a 59-year-old unmarried
female, living in an urban area, without hypertension
or heart disease, with a private job type, mean blood
glucose 92.42, bmi 23.6, and a non-smoker.
4 Conclusion
Stroke is harmful to humans and should be prevented
early. In order to prevent stroke, you can strengthen
physical exercise, enhance physical fitness, improve
disease resistance, and delay aging. Regular medical
check-ups and disease screening.
In this paper, the random forest algorithm and
XGBoost algorithm are used to the learning of stroke,
and prediction of stroke. Based on evaluating the f1-
score and accuracy, the XGBoost model is selected.
References:
[1] Lo, E., Dalkara, T. & Moskowitz, M.
Mechanisms, challenges and opportunities in
stroke. Nat Rev Neurosci 4, 399–414 (2003).
[2] Dritsas E, Trigka M. Stroke risk prediction with
machine learning techniques[J]. Sensors, 2022,
22(13): 4670.
[3] Asuero A G, Sayago A, González A G. The
correlation coefficient: An overview[J]. Critical
reviews in analytical chemistry, 2006, 36(1): 41-
59.
[4] Zain M, Ibrahim M. The significance of P-value
in medical research[J]. Journal of Allied Health
Sciences, 2015, 1(1): 74-85.
[5] Fernández A, Garcia S, Herrera F, et al. SMOTE
for learning from imbalanced data: progress and
challenges, marking the 15-year anniversary[J].
Journal of artificial intelligence research, 2018,
61: 863-905.
[6] T. Hastie, R. Tibshirani, J. Friedman, The
Elements of Statistical Learning, 2nd edition,
Springer, New York/Berlin/Heidelberg, 2008.
[7] Liashchynskyi P, Liashchynskyi P. Grid search,
random search, genetic algorithm: a big
comparison for NAS[J]. arXiv preprint
arXiv:1912.06059, 2019.
Contribution of individual authors to
the creation of a scientific article
Wenwen He: Conceptualization, Methodology,
Software, Validation, Writing- review & editing.
Pengcheng Du: Conceptualization, Software.
Hongli Le: Writing- original draft, Visualization,
Supervision, Data curation, Formal analysis.
Creative Commons Attribution License 4.0
(Attribution 4.0 International, CC BY 4.0)
This article is published under the terms of the Creative
Commons Attribution License 4.0
https://creativecommons.org/licenses/by/4.0/deed.en_US
International Journal of Applied Sciences & Development
DOI: 10.37394/232029.2022.1.2
Wenwen He, Hongli Le, Pengcheng Du
E-ISSN: 2945-0454
10
Volume 1, 2022