Machine Learning for Stroke Prediction: Evaluating the Effectiveness of Data Balancing Approaches

Muhamad Indra; Siti Ernawati; Ilham Maulana

doi:10.34288/jri.v6i4.344

Authors

Muhamad Indra Universitas Nusa Mandiri
Siti Ernawati
Ilham Maulana

(*) Corresponding Author

DOI:

https://doi.org/10.34288/jri.v6i4.344

Keywords:

Stroke Prediction, Data Balancing Technique, Artificial Intelegence, Classification, Imbalanced Data

Abstract

Stroke occurs due to disrupted blood flow to the brain, either from a blood clot (ischemic) or a ruptured blood vessel (hemorrhagic), leading to brain tissue damage and neurological dysfunction. It remains a leading cause of death and disability worldwide, making early prediction crucial for timely intervention. This study evaluates the impact of data balancing techniques on stroke prediction performance across different machine learning models. Random Forest (RF) consistently achieves the highest accuracy (98%) but struggles with precision and recall variations depending on the balancing method. Decision Tree (DT) and K-Nearest Neighbors (KNN) benefit most from SMOTE and SMOTETomek, improving their F1-scores (11.21% and 9.18%), indicating better balance between precision and recall. Random Under Sampling enhances recall across all models but reduces precision, leading to lower overall predictive reliability. SMOTE and SMOTETomek emerge as the most effective balancing techniques, particularly for DT and KNN, while RF remains the most accurate but requires further optimization to improve precision and recall balance.

Downloads

Download data is not yet available.

References

Al Hashmi, A. M., Shuaib, A., Imam, Y., Amr, D., Humaidan, H., Al Nidawi, F., Sarhan, A., Mustafa, W., Khalefa, W., Ramadan, I., Usman, F. S., Hokmabadi, E. S., Ghorbani, M., Nassir, T., Aladham, F., Salmeen, A., Kikano, R., Muda, S., Jose, S., … Mansour, O. Y. (2022). Stroke services in the Middle East and adjacent region: A survey of 34 hospital-based stroke services. Frontiers in Neurology, 13. https://doi.org/10.3389/fneur.2022.1016376

Auer, R. N., & Sommer, C. J. (2021). Histopathology of Brain Tissue Response to Stroke and Injury. Stroke: Pathophysiology, Diagnosis, and Management, November, 0–8. https://doi.org/10.1016/B978-0-323-69424-7.00004-1

Avan, A., & Hachinski, V. (2021). Stroke and dementia, leading causes of neurological disability and death, potential for prevention. Alzheimer’s and Dementia, 17(6), 1072–1076. https://doi.org/10.1002/alz.12340

Bergmann, P., Batzner, K., Fauser, M., Sattlegger, D., & Steger, C. (2021). The MVTec Anomaly Detection Dataset: A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection. International Journal of Computer Vision, 129(4), 1038–1059. https://doi.org/10.1007/s11263-020-01400-4

Bohr, A., & Memarzadeh, K. (2020). The rise of artificial intelligence in healthcare applications. In Artificial Intelligence in Healthcare (Issue January). https://doi.org/10.1016/B978-0-12-818438-7.00002-2

ÇETİNKAYA, Z., & HORASAN, F. (2021). Decision Trees in Large Data Sets. Uluslararası Muhendislik Arastirma ve Gelistirme Dergisi, 13(1), 140–151. https://doi.org/10.29137/umagd.763490

Charbuty, B., & Abdulazeez, A. (2021). Classification Based on Decision Tree Algorithm for Machine Learning. Journal of Applied Science and Technology Trends, 2(01), 20–28. https://doi.org/10.38094/jastt20165

Chen, J., Du, H., Mao, F., Huang, Z., Chen, C., Hu, M., & Li, X. (2024). Improving forest age prediction performance using ensemble learning algorithms base on satellite remote sensing data. Ecological Indicators, 166(April), 112327. https://doi.org/10.1016/j.ecolind.2024.112327

Chen, S., Shao, L., & Ma, L. (2021). Cerebral Edema Formation After Stroke: Emphasis on Blood–Brain Barrier and the Lymphatic Drainage System of the Brain. Frontiers in Cellular Neuroscience, 15(August), 1–17. https://doi.org/10.3389/fncel.2021.716825

Dablain, D., Krawczyk, B., & Chawla, N. V. (2023). DeepSMOTE: Fusing Deep Learning and SMOTE for Imbalanced Data. IEEE Transactions on Neural Networks and Learning Systems, 34(9), 6390–6404. https://doi.org/10.1109/TNNLS.2021.3136503

Dahouda, M. K., & Joe, I. (2021). A Deep-Learned Embedding Technique for Categorical Features Encoding. IEEE Access, 9, 114381–114391. https://doi.org/10.1109/ACCESS.2021.3104357

Duncan, P. W., Bushnell, C., Sissine, M., Coleman, S., Lutz, B. J., Johnson, A. M., Radman, M., Pvru Bettger, J., Zorowitz, R. D., & Stein, J. (2021). Comprehensive Stroke Care and Outcomes: Time for a Paradigm Shift. Stroke, 52(1), 385–393. https://doi.org/10.1161/STROKEAHA.120.029678

Elfa, M. A. A., & Dawood, M. E. T. (2023). Using Arti fi cial Intelligence for Enhancing. Journal of Art, Design & Music, 2(2).

Ferdinandy, B., Gerencsér, L., Corrieri, L., Perez, P., Újváry, D., Csizmadia, G., & Miklósi, Á. (2020). Challenges of machine learning model validation using correlated behaviour data: Evaluation of cross-validation strategies and accuracy measures. PLoS ONE, 15(7), 1–14. https://doi.org/10.1371/journal.pone.0236092

Fornacon-Wood, I., Mistry, H., Ackermann, C. J., Blackhall, F., McPartlin, A., Faivre-Finn, C., Price, G. J., & O’Connor, J. P. B. (2020). Reliability and prognostic value of radiomic features are highly dependent on choice of feature extraction platform. European Radiology, 30(11), 6241–6250. https://doi.org/10.1007/s00330-020-06957-9

Fujiwara, K., Huang, Y., Hori, K., Nishioji, K., Kobayashi, M., Kamaguchi, M., & Kano, M. (2020). Over- and Under-sampling Approach for Extremely Imbalanced and Small Minority Data Problem in Health Record Analysis. Frontiers in Public Health, 8(May), 1–15. https://doi.org/10.3389/fpubh.2020.00178

Ganesha, H. R., & Aithal, P. S. (2022). How to Choose an Appropriate Research Data Collection Method and Method Choice Among Various Research Data Collection Methods and Method Choices During Ph.D. Program in India? International Journal of Management, Technology, and Social Sciences, 7(2), 455–489. https://doi.org/10.47992/ijmts.2581.6012.0233

Hairani, H., Anggrawan, A., & Priyanto, D. (2023). Improvement Performance of the Random Forest Method on Unbalanced Diabetes Data Classification Using Smote-Tomek Link. International Journal on Informatics Visualization, 7(1), 258–264. https://doi.org/10.30630/joiv.7.1.1069

Hasanin, T., Khoshgoftaar, T. M., Leevy, J. L., & Bauder, R. A. (2019). Severely imbalanced Big Data challenges: investigating data sampling approaches. Journal of Big Data, 6(1). https://doi.org/10.1186/s40537-019-0274-4

Ivanov, I. G., Kumchev, Y., & Hooper, V. J. (2023). An Optimization Precise Model of Stroke Data to Improve Stroke Prediction. Algorithms, 16(9), 1–16. https://doi.org/10.3390/a16090417

Jäger, S., Allhorn, A., & Bießmann, F. (2021). A Benchmark for Data Imputation Methods. Frontiers in Big Data, 4(July), 1–16. https://doi.org/10.3389/fdata.2021.693674

Jassim, M. A., & Abdulwahid, S. N. (2021). Data Mining preparation: Process, Techniques and Major Issues in Data Analysis. IOP Conference Series: Materials Science and Engineering, 1090(1), 012053. https://doi.org/10.1088/1757-899x/1090/1/012053

Johnson, T. F., Isaac, N. J. B., Paviolo, A., & González-Suárez, M. (2021). Handling missing values in trait data. Global Ecology and Biogeography, 30(1), 51–62. https://doi.org/10.1111/geb.13185

Jones, P. R. (2019). A note on detecting statistical outliers in psychophysical data. Attention, Perception, and Psychophysics, 81(5), 1189–1196. https://doi.org/10.3758/s13414-019-01726-3

Khan, Z., Ali, A., & Aldahmani, S. (2024). Feature Selection via Robust Weighted Score for High Dimensional Binary Class-Imbalanced Gene Expression Data. Heliyon, 10(19), e38547. https://doi.org/10.1016/j.heliyon.2024.e38547

Khattab, A. A. R., Elshennawy, N. M., & Fahmy, M. (2023). GMA: Gap Imputing Algorithm for time series missing values. Journal of Electrical Systems and Information Technology, 10(1), 1–20. https://doi.org/10.1186/s43067-023-00094-1

Kiyak, E. O., & Ghasemkhani, B. (2023). High-Level K-Nearest Neighbors ( HLKNN ): A Supervised. Electronics, 12, 1–20.

Krstinić, D., Braović, M., Šerić, L., & Božić-Štulić, D. (2020). Multi-label Classifier Performance Evaluation with Confusion Matrix. 01–14. https://doi.org/10.5121/csit.2020.100801

Li, J., Othman, M. S., Chen, H., & Yusuf, L. M. (2024). Optimizing IoT intrusion detection system: feature selection versus feature extraction in machine learning. Journal of Big Data, 11(1). https://doi.org/10.1186/s40537-024-00892-y

Li, W., Yue, T., & Liu, Y. (2020). New understanding of the pathogenesis and treatment of stroke-related sarcopenia. Biomedicine and Pharmacotherapy, 131(September), 110721. https://doi.org/10.1016/j.biopha.2020.110721

Murphy, S. J., & Werring, D. J. (2020). Stroke: causes and clinical features. Medicine (United Kingdom), 48(9), 561–566. https://doi.org/10.1016/j.mpmed.2020.06.002

Musmar, B., Adeeb, N., Ansari, J., Sharma, P., & Cuellar, H. H. (2022). Endovascular Management of Hemorrhagic Stroke. Biomedicines, 10(1). https://doi.org/10.3390/biomedicines10010100

Niles, J., Bhasin, G., & Ganti, L. (2024). Large right middle cerebral artery stroke with hemorrhagic transformation. International Journal of Emergency Medicine, 17(1). https://doi.org/10.1186/s12245-024-00739-6

Nizam-Ozogur, H., & Orman, Z. (2024). A heuristic-based hybrid sampling method using a combination of SMOTE and ENN for imbalanced health data. Expert Systems, 41(8), 1–22. https://doi.org/10.1111/exsy.13596

Noaman, A. Y., Gad-Elrab, A. A. A., & Baabdullah, A. M. (2024). Towards Scientists and Researchers Classification Model (SRCM)-based machine learning and data mining methods: An ISM-MICMAC approach. Journal of Innovation and Knowledge, 9(3), 100516. https://doi.org/10.1016/j.jik.2024.100516

Peng, M., Zhang, Q., Xing, X., Gui, T., Huang, X., Jiang, Y. G., Ding, K., & Chen, Z. (2019). Trainable undersampling for class-imbalance learning. 33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, 4707–4714. https://doi.org/10.1609/aaai.v33i01.33014707

Rendón, E., Alejo, R., Castorena, C., Isidro-Ortega, F. J., & Granda-Gutiérrez, E. E. (2020). Data sampling methods to dealwith the big data multi-class imbalance problem. Applied Sciences (Switzerland), 10(4). https://doi.org/10.3390/app10041276

Sailasya, G., & Kumari, G. L. A. (2021). Analyzing the Performance of Stroke Prediction using ML Classification Algorithms. International Journal of Advanced Computer Science and Applications, 12(6), 539–545. https://doi.org/10.14569/IJACSA.2021.0120662

Shneiderman, B. (2020). Bridging the gap between ethics and practice: Guidelines for reliable, safe, and trustworthy human-centered AI systems. ACM Transactions on Interactive Intelligent Systems, 10(4). https://doi.org/10.1145/3419764

Sohn, K., & Kwon, O. (2020). Technology acceptance theories and factors influencing artificial Intelligence-based intelligent products. Telematics and Informatics, 47, 101324. https://doi.org/10.1016/j.tele.2019.101324

Tran, N., Chen, H., Jiang, J., Bhuyan, J., & Ding, J. (2021). Effect of class imbalance on the performance of machine learning-based network intrusion detection. International Journal of Performability Engineering, 17(9), 741–755. https://doi.org/10.23940/ijpe.21.09.p1.741755

Uddin, S., Haque, I., Lu, H., Moni, M. A., & Gide, E. (2022). Comparative performance analysis of K-nearest neighbour (KNN) algorithm and its different variants for disease prediction. Scientific Reports, 12(1), 1–11. https://doi.org/10.1038/s41598-022-10358-x

Vujović, Ž. (2021). Classification Model Evaluation Metrics. International Journal of Advanced Computer Science and Applications, 12(6), 599–606. https://doi.org/10.14569/IJACSA.2021.0120670

Wang, S., Dai, Y., Shen, J., & Xuan, J. (2021). Research on expansion and classification of imbalanced data based on SMOTE algorithm. Scientific Reports, 11(1), 1–11. https://doi.org/10.1038/s41598-021-03430-5

Woodward, M. (2019). Cardiovascular disease and the female disadvantage. International Journal of Environmental Research and Public Health, 16(7). https://doi.org/10.3390/ijerph16071165

Xiao, H. H., Yang, W. K., Hu, J., Zhang, Y. P., Jing, L. J., & Chen, Z. Y. (2022). Significance and methodology: Preprocessing the big data for machine learning on TBM performance. Underground Space (China), 7(4), 680–701. https://doi.org/10.1016/j.undsp.2021.12.003

Yadav, D. C., & Pal, S. (2020). Prediction of heart disease using feature selection and random forest ensemble method. International Journal of Pharmaceutical Research, 12(4), 56–66. https://doi.org/10.31838/ijpr/2020.12.04.013

Yan, Y., Tan, M., Xu, Y., Cao, J., Ng, M., Min, H., & Wu, Q. (2019). Oversampling for imbalanced data via optimal transport. 33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, 5605–5612. https://doi.org/10.1609/aaai.v33i01.33015605

Zeng, Z., Chen, P. J., & Lew, A. A. (2020). From high-touch to high-tech: COVID-19 drives robotics adoption. Tourism Geographies, 22(3), 724–734. https://doi.org/10.1080/14616688.2020.1762118

Zhang, J., Chen, L., & Abid, F. (2019). Prediction of Breast Cancer from Imbalance Respect Using Cluster-Based Undersampling Method. Journal of Healthcare Engineering, 2019. https://doi.org/10.1155/2019/7294582