🔗 Web Application can be accessed at cvd-risk-prediction.streamlit.app

📃 Full-text article can be accessed here!


Project Overview

  • This project processed and analyzed 438,693 records of data from the 2021 BRFSS data from CDC. Machine Leaning models were utilized to predict the risk of developing CVDs by only using personal lifestyle factors. The Logistic Regression model was able to predict the people that are at risk of CVDs with 79.18% accuracy.

  • Developed a web application for this study using Streamlit. The web application visualizes the results of the study and allows users to interact with the model by inputting their personal lifestyle factors to determine their risk of developing cardiovascular diseases.

  • Presented the findings of this study in Mathematical Society of the Philippines – National Capital Region (MSP-NCR) Annual Convention 2023. The results of this project was published in European-Americal Journals.

Abstract


For a long time, Cardiovascular diseases (CVD) is still one of the leading causes of death globally. The rise of new technologies such as Machine Learning (ML) algorithms can help with the early detection and prevention of developing CVDs. This study mainly focuses on the utilization of different ML models to determine the risk of a person in developing CVDs by using their personal lifestyle factors. This study used, extracted, and processed the 438,693 records as data from the Behavioral Risk Factor Surveillance System (BRFSS) in 2021 from World Health Organization (WHO). The data was then partitioned into training and testing data with a ratio of 0.8:0.2 to have an unknown data to evaluate the model that will be trained on. One problem that this study faced is the Imbalance among the classes and this was solved by using sampling techniques in order to balance the data for the ML model to process and understand well. The performance of the ML models was evaluated using 10-Stratified Fold cross-validation testing and the best model is Logistic Regression (LR) with F1 score of 0.32564. Logistic Regression model was then subjected to hyperparameter tuning and got the best score of 0.3257 with C = 0.1. Feature Importance was also generated from the LR model and the features that have the most impact is Sex, Diabetes, and the General Health of an individual. After getting the final LR model, it was then evaluated in the testing data and got a F1 score of 0.33. The Confusion Matrix was also used to better visualize the performance. And, the LR model correctly classified 79.18 % of people with CVDs and 73.46 % of people that is healthy. The AUC-ROC Curve was also used as a performance metric and the LR model got an AUC score of 0.837. The Logistic Regression model can be used in the medical field and can be utilized more by adding medical attributes to the data. Overall, this study gave us an insight and significant knowledge that can help in predicting the risk of CVDs by only using the personal attributes of an individual.