Stroke Prediction Analysis

3 min readMay 8, 2021

In the United States today, strokes are the leading cause of death for Americans. In this project, I will explore how certain characteristics (age, BMI, average glucose levels, occupation, etc.) correlate with the likelihood of stroke development.

Reading and Exploring the Dataset

The target variable of this dataset is called “Stroke”, with 1 = had a stroke and 0 = no stroke. Some categorical variables included gender, hypertension (1 = yes, 0 = no), heart_disease, ever_married, work_type, Residence_type, smoking_status, and Age_Group (a separate category I created to group individuals of similar ages into different bins. The continuous variables included age, avg_glucose_level, and BMI.

To begin, I created a bar chart to explore if there seems to be any correlation between age group and the likelihood of having a stroke. The results are as follows:

From this chart, we can see that with individuals starting from the 30–40 age group, cases of strokes become more severe as ages increase. To further my investigation, I created a heat map of the different variables to look for any correlations between them.

From the correlation matrix, there is no strong correlation between the different features. The highest correlations can be seen between age and marriage status as well as BMI and marriage status. Some other features that show to be more relevant with strokes are age, hypertension, heart disease, and average glucose levels.

Modeling

After exploring the data and noting down some of the observations for relationships, I split the data into Training and Testing groups. The test size was 0.2 and the random state set to 30.

Findings from testing out different models are as follows:

Logistic Regression

Accuracy score of 0.9657534246575342

2. Decision Tree

Accuracy score of 0.9197651663405088

3. Random Forest Model

Accuracy score of 0.9608610567514677. The top 3 features in the random forest model were ranked to show which features were most important in the model. The rankings were 1) Average Glucose Level, 2) BMI, and 3) Age.

Using cross-validation and L1 and L2 regularization, I was able to simplify the model. The results are as follows:

The most accurate model out of the ones tested was Logistic Regression, with an accuracy score of 0.9657534.

To sum up, this prediction analysis revealed interesting results on how strokes can be predicted. As it causes 1 in 20 deaths in Americans every year, using the necessary tools to try to predict causes and correlations between features brings important insights to us all.

Stroke Prediction Analysis

Reading and Exploring the Dataset

Modeling

Written by Cynthia Yang