
Stroke prediction model based on XGBoost algorithm
WENWEN HE, HONGLI LE, PENGCHENG DU
School of Artificial Intelligence, Liangjiang
Chongqing University of Technology
Chongqing
CHINA
Abstract: In this paper, individual sample data randomly measured are preprocessed, for example, outliers values
are deleted and the characteristics of the samples are normalized to between 0 and 1. The correlation analysis
approach is then used to determine and rank the relevance of stroke characteristics, and factors with poor
correlation are discarded. The samples are randomly split into a 70% training set and a 30% testing set. Finally,the
random forest model and XGBoost algorithm combined with cross-validation and grid search method are
implemented to learn the stroke characteristics. The accuracy of the testing set by the XGBoost algorithm is
0.9257, which is better than that of the random forest model with 0.8991. Thus, the XGBoost model is selected
to predict the stroke for ten people, and the obtained conclusion is that two people have a stroke and eight people
have no stroke.
Key-Words: Correlation coefficient; Stroke; Random forest model; XGBoost algorithm
Received: March 11, 2022. Revised: October 13, 2022. Accepted: November 9, 2022. Published: December 13, 2022.
1 Introduction
A stroke is a group of symptoms caused by blockages
or bleeding in blood arteries that supply the brain.
Most stroke patients are ischemic stroke, the essence
of which is that there is a blood clot in the
cerebrovascular that blocks the blood vessel, or the
blood vessel becomes narrow, in order that the blood
supply to the brain tissue is not available, which will
cause the hypoxia or death of some brain cells [1].
This event, also known as cerebral infarction, results
in death or malfunction. Both early prevention and
timely treatment are crucial. For stroke, a particular
disease, a lot of research is based on machine learning
methods [2]. In this paper, the random forest model
and XGBoost algorithm are used to study stroke
prevention and understand the relationship between
stroke and individual characteristics.
2 Methods
2.1 Data preprocessing
A total of 4,861 people without a stroke make up the
data set's 5110 random samples, which comprise 12
attributes including number, age, and gender. Thus,
there is an imbalance in the data set, which is solved
by the Smote upsampling method in this paper.
Additionally, using data on age, sex, and current
body mass index values, decision tree regression is
utilized to estimate the missing BMI values.The
outliers of all genders except male and female in the
gender column are removed. The module of
OneHotEncoder is used to encode character variables,
and then the features of numerical variables are
normalized between 0 and 1.
2.2 Feature selection
A scatter plot is drawn to examine the linear
relationship between the data as shown in Figure 1.
As demonstrated in Figure 1, the scatter plot has a
large number of category dummy variables and rank
variables that do not accurately represent the
connections between the arrays.
Moreover, quantitative data columns such as "age",
"average glucose content grade" and "body mass
index" do not show a linear relationship with each
other, thus the correlation coefficient method is used
to solve the correlation calculation [3].
At the 95% and 90% confidence levels, a two-
sided test is run. The test determines if the estimated
value fits into the rejection or acceptance domain. If
the confidence level is less than 90%, the original
hypothesis is rejected and denoted with a "*." If the
confidence level is less than 95%, the original
hypothesis is rejected and denoted with "* *."Then,
the correlation coefficients between individual
characteristics affecting stroke are obtained and
shown in Table 3.
Table 3 The correlation coefficient of the
features
Stroke
Gender 0.009 0.516
Age 0.250** 0.000
Hypertension 0.128** 0.000
Heart disease 0.135** 0.000
Married -0.108** 0.000
Work Type -0.038** 0.007
Type Residence -0.015 0.269
Average glucose content 0.083** 0.000
Body mass index 0.055** 0.000
Smoking_status -0.067** 0.000
International Journal of Applied Sciences & Development
DOI: 10.37394/232029.2022.1.2
Wenwen He, Hongli Le, Pengcheng Du