Stroke prediction model based on XGBoost algorithm

WENWEN HE, HONGLI LE, PENGCHENG DU

School of Artificial Intelligence, Liangjiang

Chongqing University of Technology

Chongqing

CHINA

Abstract: In this paper, individual sample data randomly measured are preprocessed, for example, outliers values

are deleted and the characteristics of the samples are normalized to between 0 and 1. The correlation analysis

approach is then used to determine and rank the relevance of stroke characteristics, and factors with poor

correlation are discarded. The samples are randomly split into a 70% training set and a 30% testing set. Finally,the

random forest model and XGBoost algorithm combined with cross-validation and grid search method are

implemented to learn the stroke characteristics. The accuracy of the testing set by the XGBoost algorithm is

0.9257, which is better than that of the random forest model with 0.8991. Thus, the XGBoost model is selected

to predict the stroke for ten people, and the obtained conclusion is that two people have a stroke and eight people

have no stroke.

Key-Words: Correlation coefficient; Stroke; Random forest model; XGBoost algorithm

Received: March 11, 2022. Revised: October 13, 2022. Accepted: November 9, 2022. Published: December 13, 2022.

1 Introduction

A stroke is a group of symptoms caused by blockages

or bleeding in blood arteries that supply the brain.

Most stroke patients are ischemic stroke, the essence

of which is that there is a blood clot in the

cerebrovascular that blocks the blood vessel, or the

blood vessel becomes narrow, in order that the blood

supply to the brain tissue is not available, which will

cause the hypoxia or death of some brain cells [1].

This event, also known as cerebral infarction, results

in death or malfunction. Both early prevention and

timely treatment are crucial. For stroke, a particular

disease, a lot of research is based on machine learning

methods [2]. In this paper, the random forest model

and XGBoost algorithm are used to study stroke

prevention and understand the relationship between

stroke and individual characteristics.

2 Methods

2.1 Data preprocessing

A total of 4,861 people without a stroke make up the

data set's 5110 random samples, which comprise 12

attributes including number, age, and gender. Thus,

there is an imbalance in the data set, which is solved

by the Smote upsampling method in this paper.

Additionally, using data on age, sex, and current

body mass index values, decision tree regression is

utilized to estimate the missing BMI values.The

outliers of all genders except male and female in the

gender column are removed. The module of

OneHotEncoder is used to encode character variables,

and then the features of numerical variables are

normalized between 0 and 1.

2.2 Feature selection

A scatter plot is drawn to examine the linear

relationship between the data as shown in Figure 1.

As demonstrated in Figure 1, the scatter plot has a

large number of category dummy variables and rank

variables that do not accurately represent the

connections between the arrays.

Moreover, quantitative data columns such as "age",

"average glucose content grade" and "body mass

index" do not show a linear relationship with each

other, thus the correlation coefficient method is used

to solve the correlation calculation [3].

At the 95% and 90% confidence levels, a two-

sided test is run. The test determines if the estimated

value fits into the rejection or acceptance domain. If

the confidence level is less than 90%, the original

hypothesis is rejected and denoted with a "*." If the

confidence level is less than 95%, the original

hypothesis is rejected and denoted with "* *."Then,

the correlation coefficients between individual

characteristics affecting stroke are obtained and

shown in Table 3.

Table 3 The correlation coefficient of the

features

Stroke

Gender 0.009 0.516

Age 0.250** 0.000

Hypertension 0.128** 0.000

Heart disease 0.135** 0.000

Married -0.108** 0.000

Work Type -0.038** 0.007

Type Residence -0.015 0.269

Average glucose content 0.083** 0.000

Body mass index 0.055** 0.000

Smoking_status -0.067** 0.000

International Journal of Applied Sciences & Development

DOI: 10.37394/232029.2022.1.2

Wenwen He, Hongli Le, Pengcheng Du

E-ISSN: 2945-0454

Volume 1, 2022

Table 3 shows that P>0.05 means no significant

difference, 0.01<p<0.05 means a significant

difference, and p<0.01 means an extremely

significant difference[4]. Age is a significant positive

correlation factor, followed by heart disease and

hypertension, respectively. The average glucose level

is also a major part of the influencing factors,

followed by body mass index, which is all

significantly positive correlation factors. In the

hierarchy of stroke causes, gender and type of

habitation are at the bottom of the positive correlation

coefficient, and there is no significant correlation.

For whether married, smoking level ,and type of

work, they show a significant negative correlation.

3 Solution to the Problem

After data preprocessing ,the data set is randomly

divided, 70% of which is used as a training set and

30% of which is used as a testing set. We study the

dichotomous classification problem of stroke disease,

which is divided into two classes: stroke and no

stroke. The dataset for this study is unbalanced since

249 persons had strokes whereas 4861 people do not,

giving a data ratio of almost one to twenty. For the

unbalanced data set, this paper uses SMOTE

upsampling method to increase the number of a few

stroke samples to obtain a relatively balanced

training set. SMOTE is considered one of the most

influential data sampling algorithms in machine

learning and data mining[5]. In addition, the accuracy

rate, recall rate, and classification report are used as

performance evaluation indexes to evaluate the

subsequent prediction results accordingly.

Cross-validation is a statistical method used to

evaluate the performance of generalization. Cross-

validation is a data resampling method to assess the

generalization ability of predictive models and to

prevent overfitting[6]. The main concept is to divide

the original data into two groups, one for training and

one for validation. The classifier is first trained with

the training set, and then tested with the validation

set, which serves as the performance metric for

evaluating the classifier.

The random forest algorithm and XGBoost

algorithm are modeled, and the sample data are

grouped and trained by the k-fold cross-validation

method with k=10 selected. The training data are

substituted into the random forest algorithm and the

extreme gradient boosting tree model for training

fitting. The comparison of model scores can be

obtained by 10-fold cross-validation training in Table

Table 4 Scores of random forest model and

XGBoost model

Random Forest

0.9592

XGBoost

0.9562

The average score of the random forest model is

0.9592, and the average score of the XGBoost model

is 0.9562, So the two models have similar scores.

Table 5 f1 scores of random forest model and

extreme gradient lifting tree model

Random Forest

0.1739

XGBoost

0.1754

The random forest model f1 score of 0.1739 is larger

than the XGBoost model f1 score of 0.1754, so the

random forest model performs is a little better, but

the f1 scores are not good.

Tables 6 and 7 report the classification of the random

forest model and the XGBoost tree model,

respectively.

Table 6 Classification result of random forest

model

precision

recall

F1-

score

support

0.94

1198

0.17

Accuracy

0.00

0.90

1278

Macro

avg

0.56

1278

Weighted

avg

0.90

1278

Accuracy

score

0.8953

Table 7 Classification result of XGBoost model

precision

recall

F1-

score

support

0.94

0.98

0.96

1198

0.29

0.12

0.18

Accuracy

0.00

0.93

1278

Macro

avg

0.62

0.55

0.57

1278

Weighted

avg

0.90

0.93

0.91

1278

Accuracy

score

0.9264

By comparing the accuracy scores of the two models,

the XGBoost model is higher than the random forest

model.

3.1 Model optimization

The random forest model and XGBoost model under

K-fold cross-validation are used for stroke

International Journal of Applied Sciences & Development

DOI: 10.37394/232029.2022.1.2

Wenwen He, Hongli Le, Pengcheng Du

E-ISSN: 2945-0454

Volume 1, 2022

prediction, and the prediction results of the stroke in

the test set are obtained in Table 8, which show the

accuracy comparison between the test set and the

training set of the two prediction models.

Figure 1 scatter diagram Figure 2 roc curve

Table 8 Accuracy of random forest model and

XGBoost model

Random

forest

XGBoost

Training Set

Accuracy

1.0

Accuracy of

the test set

0.8960

0.9264

Table 8 shows the accuracy of the test set and the

training set of the random forest model and the

xgboost model. The accuracy of the training set of the

two models is 1, but the accuracy of the test set of the

random forest model is lower than that of the xgboost

model.

In order to determine the classification performance

of the model, we can use the ROC curve to visualize

the prediction results. From the above figure, we can

see that the AUC value of the random forest model is

0.025 larger than that of the XGBoost model, and the

AUC values of the two models are between 0.7 and

0.85, indicating that the prediction effect of the model

is average.

Grid search is a parameter tuning method used to find

the optimal parameters. The traditional method of

hyperparameters optimization is a grid search, which

simply makes a complete search over a given subset

of the hyperparameters space of the training

algorithm[7]. Grid search is used to find the optimal

parameters for the stroke prediction model based on

random forest and XGBoost algorithms. The results

of the grid search are shown in Table 9 for the optimal

parameters of the random forest model and Table 10

for the optimal parameters of the XGBoost model,

respectively.

Table 9 Optimal parameters of random forest

model

Parameter Name

Parameter Value

bootstrap

False

max_features

n_estimators

200

Table 10 Optimal parameters of XGBoost

Parameter Name

Parameter Value

booster

gbtree

learning_rate

0.1

max_depth

min_child_weight

subsample

0.8

colsample_bytree

reg_lambda

n_estimators

100000

The model was built again with the latest parameters

and the following classification report scores were

obtained as follows.

International Journal of Applied Sciences & Development

DOI: 10.37394/232029.2022.1.2

Wenwen He, Hongli Le, Pengcheng Du

E-ISSN: 2945-0454

Volume 1, 2022

Table 11 Classification result of the random

forest model

precision

recall

F1-

score

support

0.94

0.95

1198

0.15

0.14

0.15

Accuracy

0.00

0.90

1278

Macro

avg

0.55

0.54

0.55

1278

Weighted

avg

0.89

0.90

1278

Accuracy

score

0.8991

Table 12 Classification result of XGBoost

model

precision

recall

F1-

score

support

0.94

0.98

0.96

1198

0.29

0.12

0.18

Accuracy

0.00

0.93

1278

Macro

avg

0.62

0.55

0.57

1278

Weighted

avg

0.90

0.93

0.91

1278

Accuracy

score

0.9264

By training and optimizing the model and comparing

the accuracy of the 2 models, we finally choose the

XGBoost model with a higher score.

3.2 Realization of stroke prediction

The XGBoost stroke prediction model is selected to

predict strokes for a new dataset. The dataset has 10

individuals and it includes 12 characteristics such as

number, age, gender ,and so on. The dataset is

preprocessed to achieve standardization and then

input to the model for prediction, and the following

prediction results were obtained

Table 13 Prediction Results

stroke

The results showed one-tenth of the predictors of

stroke. The predictor is a 59-year-old unmarried

female, living in an urban area, without hypertension

or heart disease, with a private job type, mean blood

glucose 92.42, bmi 23.6, and a non-smoker.

4 Conclusion

Stroke is harmful to humans and should be prevented

early. In order to prevent stroke, you can strengthen

physical exercise, enhance physical fitness, improve

disease resistance, and delay aging. Regular medical

check-ups and disease screening.

In this paper, the random forest algorithm and

XGBoost algorithm are used to the learning of stroke,

and prediction of stroke. Based on evaluating the f1-

score and accuracy, the XGBoost model is selected.

References:

[1] Lo, E., Dalkara, T. & Moskowitz, M.

Mechanisms, challenges and opportunities in

stroke. Nat Rev Neurosci 4, 399–414 (2003).

[2] Dritsas E, Trigka M. Stroke risk prediction with

machine learning techniques[J]. Sensors, 2022,

22(13): 4670.

[3] Asuero A G, Sayago A, González A G. The

correlation coefficient: An overview[J]. Critical

reviews in analytical chemistry, 2006, 36(1): 41-

59.

[4] Zain M, Ibrahim M. The significance of P-value

in medical research[J]. Journal of Allied Health

Sciences, 2015, 1(1): 74-85.

[5] Fernández A, Garcia S, Herrera F, et al. SMOTE

for learning from imbalanced data: progress and

challenges, marking the 15-year anniversary[J].

Journal of artificial intelligence research, 2018,

61: 863-905.

[6] T. Hastie, R. Tibshirani, J. Friedman, The

Elements of Statistical Learning, 2nd edition,

Springer, New York/Berlin/Heidelberg, 2008.

[7] Liashchynskyi P, Liashchynskyi P. Grid search,

random search, genetic algorithm: a big

comparison for NAS[J]. arXiv preprint

arXiv:1912.06059, 2019.

Contribution of individual authors to

the creation of a scientific article

Wenwen He: Conceptualization, Methodology,

Software, Validation, Writing- review & editing.

Pengcheng Du: Conceptualization, Software.

Hongli Le: Writing- original draft, Visualization,

Supervision, Data curation, Formal analysis.

Creative Commons Attribution License 4.0

(Attribution 4.0 International, CC BY 4.0)

This article is published under the terms of the Creative

Commons Attribution License 4.0

https://creativecommons.org/licenses/by/4.0/deed.en_US

International Journal of Applied Sciences & Development

DOI: 10.37394/232029.2022.1.2

Wenwen He, Hongli Le, Pengcheng Du

E-ISSN: 2945-0454

Volume 1, 2022