Machine Learning for Stroke Prediction: Evaluating the Effectiveness of Data Balancing Approaches
DOI:
https://doi.org/10.34288/jri.v6i4.344Keywords:
Stroke Prediction, Data Balancing Technique, Artificial Intelegence, Classification, Imbalanced DataAbstract
Stroke occurs due to disrupted blood flow to the brain, either from a blood clot (ischemic) or a ruptured blood vessel (hemorrhagic), leading to brain tissue damage and neurological dysfunction. It remains a leading cause of death and disability worldwide, making early prediction crucial for timely intervention. This study evaluates the impact of data balancing techniques on stroke prediction performance across different machine learning models. Random Forest (RF) consistently achieves the highest accuracy (98%) but struggles with precision and recall variations depending on the balancing method. Decision Tree (DT) and K-Nearest Neighbors (KNN) benefit most from SMOTE and SMOTETomek, improving their F1-scores (11.21% and 9.18%), indicating better balance between precision and recall. Random Under Sampling enhances recall across all models but reduces precision, leading to lower overall predictive reliability. SMOTE and SMOTETomek emerge as the most effective balancing techniques, particularly for DT and KNN, while RF remains the most accurate but requires further optimization to improve precision and recall balance.
Downloads
References
Al Hashmi, A. M., Shuaib, A., Imam, Y., Amr, D., Humaidan, H., Al Nidawi, F., Sarhan, A., Mustafa, W., Khalefa, W., Ramadan, I., Usman, F. S., Hokmabadi, E. S., Ghorbani, M., Nassir, T., Aladham, F., Salmeen, A., Kikano, R., Muda, S., Jose, S., … Mansour, O. Y. (2022). Stroke services in the Middle East and adjacent region: A survey of 34 hospital-based stroke services. Frontiers in Neurology, 13. https://doi.org/10.3389/fneur.2022.1016376
Auer, R. N., & Sommer, C. J. (2021). Histopathology of Brain Tissue Response to Stroke and Injury. Stroke: Pathophysiology, Diagnosis, and Management, November, 0–8. https://doi.org/10.1016/B978-0-323-69424-7.00004-1
Avan, A., & Hachinski, V. (2021). Stroke and dementia, leading causes of neurological disability and death, potential for prevention. Alzheimer’s and Dementia, 17(6), 1072–1076. https://doi.org/10.1002/alz.12340
Bergmann, P., Batzner, K., Fauser, M., Sattlegger, D., & Steger, C. (2021). The MVTec Anomaly Detection Dataset: A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection. International Journal of Computer Vision, 129(4), 1038–1059. https://doi.org/10.1007/s11263-020-01400-4
Bohr, A., & Memarzadeh, K. (2020). The rise of artificial intelligence in healthcare applications. In Artificial Intelligence in Healthcare (Issue January). https://doi.org/10.1016/B978-0-12-818438-7.00002-2
ÇETİNKAYA, Z., & HORASAN, F. (2021). Decision Trees in Large Data Sets. Uluslararası Muhendislik Arastirma ve Gelistirme Dergisi, 13(1), 140–151. https://doi.org/10.29137/umagd.763490
Charbuty, B., & Abdulazeez, A. (2021). Classification Based on Decision Tree Algorithm for Machine Learning. Journal of Applied Science and Technology Trends, 2(01), 20–28. https://doi.org/10.38094/jastt20165
Chen, J., Du, H., Mao, F., Huang, Z., Chen, C., Hu, M., & Li, X. (2024). Improving forest age prediction performance using ensemble learning algorithms base on satellite remote sensing data. Ecological Indicators, 166(April), 112327. https://doi.org/10.1016/j.ecolind.2024.112327
Chen, S., Shao, L., & Ma, L. (2021). Cerebral Edema Formation After Stroke: Emphasis on Blood–Brain Barrier and the Lymphatic Drainage System of the Brain. Frontiers in Cellular Neuroscience, 15(August), 1–17. https://doi.org/10.3389/fncel.2021.716825
Dablain, D., Krawczyk, B., & Chawla, N. V. (2023). DeepSMOTE: Fusing Deep Learning and SMOTE for Imbalanced Data. IEEE Transactions on Neural Networks and Learning Systems, 34(9), 6390–6404. https://doi.org/10.1109/TNNLS.2021.3136503
Dahouda, M. K., & Joe, I. (2021). A Deep-Learned Embedding Technique for Categorical Features Encoding. IEEE Access, 9, 114381–114391. https://doi.org/10.1109/ACCESS.2021.3104357
Duncan, P. W., Bushnell, C., Sissine, M., Coleman, S., Lutz, B. J., Johnson, A. M., Radman, M., Pvru Bettger, J., Zorowitz, R. D., & Stein, J. (2021). Comprehensive Stroke Care and Outcomes: Time for a Paradigm Shift. Stroke, 52(1), 385–393. https://doi.org/10.1161/STROKEAHA.120.029678
Elfa, M. A. A., & Dawood, M. E. T. (2023). Using Arti fi cial Intelligence for Enhancing. Journal of Art, Design & Music, 2(2).
Ferdinandy, B., Gerencsér, L., Corrieri, L., Perez, P., Újváry, D., Csizmadia, G., & Miklósi, Á. (2020). Challenges of machine learning model validation using correlated behaviour data: Evaluation of cross-validation strategies and accuracy measures. PLoS ONE, 15(7), 1–14. https://doi.org/10.1371/journal.pone.0236092
Fornacon-Wood, I., Mistry, H., Ackermann, C. J., Blackhall, F., McPartlin, A., Faivre-Finn, C., Price, G. J., & O’Connor, J. P. B. (2020). Reliability and prognostic value of radiomic features are highly dependent on choice of feature extraction platform. European Radiology, 30(11), 6241–6250. https://doi.org/10.1007/s00330-020-06957-9
Fujiwara, K., Huang, Y., Hori, K., Nishioji, K., Kobayashi, M., Kamaguchi, M., & Kano, M. (2020). Over- and Under-sampling Approach for Extremely Imbalanced and Small Minority Data Problem in Health Record Analysis. Frontiers in Public Health, 8(May), 1–15. https://doi.org/10.3389/fpubh.2020.00178
Ganesha, H. R., & Aithal, P. S. (2022). How to Choose an Appropriate Research Data Collection Method and Method Choice Among Various Research Data Collection Methods and Method Choices During Ph.D. Program in India? International Journal of Management, Technology, and Social Sciences, 7(2), 455–489. https://doi.org/10.47992/ijmts.2581.6012.0233
Hairani, H., Anggrawan, A., & Priyanto, D. (2023). Improvement Performance of the Random Forest Method on Unbalanced Diabetes Data Classification Using Smote-Tomek Link. International Journal on Informatics Visualization, 7(1), 258–264. https://doi.org/10.30630/joiv.7.1.1069
Hasanin, T., Khoshgoftaar, T. M., Leevy, J. L., & Bauder, R. A. (2019). Severely imbalanced Big Data challenges: investigating data sampling approaches. Journal of Big Data, 6(1). https://doi.org/10.1186/s40537-019-0274-4
Ivanov, I. G., Kumchev, Y., & Hooper, V. J. (2023). An Optimization Precise Model of Stroke Data to Improve Stroke Prediction. Algorithms, 16(9), 1–16. https://doi.org/10.3390/a16090417
Jäger, S., Allhorn, A., & Bießmann, F. (2021). A Benchmark for Data Imputation Methods. Frontiers in Big Data, 4(July), 1–16. https://doi.org/10.3389/fdata.2021.693674
Jassim, M. A., & Abdulwahid, S. N. (2021). Data Mining preparation: Process, Techniques and Major Issues in Data Analysis. IOP Conference Series: Materials Science and Engineering, 1090(1), 012053. https://doi.org/10.1088/1757-899x/1090/1/012053
Johnson, T. F., Isaac, N. J. B., Paviolo, A., & González-Suárez, M. (2021). Handling missing values in trait data. Global Ecology and Biogeography, 30(1), 51–62. https://doi.org/10.1111/geb.13185
Jones, P. R. (2019). A note on detecting statistical outliers in psychophysical data. Attention, Perception, and Psychophysics, 81(5), 1189–1196. https://doi.org/10.3758/s13414-019-01726-3
Khan, Z., Ali, A., & Aldahmani, S. (2024). Feature Selection via Robust Weighted Score for High Dimensional Binary Class-Imbalanced Gene Expression Data. Heliyon, 10(19), e38547. https://doi.org/10.1016/j.heliyon.2024.e38547
Khattab, A. A. R., Elshennawy, N. M., & Fahmy, M. (2023). GMA: Gap Imputing Algorithm for time series missing values. Journal of Electrical Systems and Information Technology, 10(1), 1–20. https://doi.org/10.1186/s43067-023-00094-1
Kiyak, E. O., & Ghasemkhani, B. (2023). High-Level K-Nearest Neighbors ( HLKNN ): A Supervised. Electronics, 12, 1–20.
Krstinić, D., Braović, M., Šerić, L., & Božić-Štulić, D. (2020). Multi-label Classifier Performance Evaluation with Confusion Matrix. 01–14. https://doi.org/10.5121/csit.2020.100801
Li, J., Othman, M. S., Chen, H., & Yusuf, L. M. (2024). Optimizing IoT intrusion detection system: feature selection versus feature extraction in machine learning. Journal of Big Data, 11(1). https://doi.org/10.1186/s40537-024-00892-y
Li, W., Yue, T., & Liu, Y. (2020). New understanding of the pathogenesis and treatment of stroke-related sarcopenia. Biomedicine and Pharmacotherapy, 131(September), 110721. https://doi.org/10.1016/j.biopha.2020.110721
Murphy, S. J., & Werring, D. J. (2020). Stroke: causes and clinical features. Medicine (United Kingdom), 48(9), 561–566. https://doi.org/10.1016/j.mpmed.2020.06.002
Musmar, B., Adeeb, N., Ansari, J., Sharma, P., & Cuellar, H. H. (2022). Endovascular Management of Hemorrhagic Stroke. Biomedicines, 10(1). https://doi.org/10.3390/biomedicines10010100
Niles, J., Bhasin, G., & Ganti, L. (2024). Large right middle cerebral artery stroke with hemorrhagic transformation. International Journal of Emergency Medicine, 17(1). https://doi.org/10.1186/s12245-024-00739-6
Nizam-Ozogur, H., & Orman, Z. (2024). A heuristic-based hybrid sampling method using a combination of SMOTE and ENN for imbalanced health data. Expert Systems, 41(8), 1–22. https://doi.org/10.1111/exsy.13596
Noaman, A. Y., Gad-Elrab, A. A. A., & Baabdullah, A. M. (2024). Towards Scientists and Researchers Classification Model (SRCM)-based machine learning and data mining methods: An ISM-MICMAC approach. Journal of Innovation and Knowledge, 9(3), 100516. https://doi.org/10.1016/j.jik.2024.100516
Peng, M., Zhang, Q., Xing, X., Gui, T., Huang, X., Jiang, Y. G., Ding, K., & Chen, Z. (2019). Trainable undersampling for class-imbalance learning. 33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, 4707–4714. https://doi.org/10.1609/aaai.v33i01.33014707
Rendón, E., Alejo, R., Castorena, C., Isidro-Ortega, F. J., & Granda-Gutiérrez, E. E. (2020). Data sampling methods to dealwith the big data multi-class imbalance problem. Applied Sciences (Switzerland), 10(4). https://doi.org/10.3390/app10041276
Sailasya, G., & Kumari, G. L. A. (2021). Analyzing the Performance of Stroke Prediction using ML Classification Algorithms. International Journal of Advanced Computer Science and Applications, 12(6), 539–545. https://doi.org/10.14569/IJACSA.2021.0120662
Shneiderman, B. (2020). Bridging the gap between ethics and practice: Guidelines for reliable, safe, and trustworthy human-centered AI systems. ACM Transactions on Interactive Intelligent Systems, 10(4). https://doi.org/10.1145/3419764
Sohn, K., & Kwon, O. (2020). Technology acceptance theories and factors influencing artificial Intelligence-based intelligent products. Telematics and Informatics, 47, 101324. https://doi.org/10.1016/j.tele.2019.101324
Tran, N., Chen, H., Jiang, J., Bhuyan, J., & Ding, J. (2021). Effect of class imbalance on the performance of machine learning-based network intrusion detection. International Journal of Performability Engineering, 17(9), 741–755. https://doi.org/10.23940/ijpe.21.09.p1.741755
Uddin, S., Haque, I., Lu, H., Moni, M. A., & Gide, E. (2022). Comparative performance analysis of K-nearest neighbour (KNN) algorithm and its different variants for disease prediction. Scientific Reports, 12(1), 1–11. https://doi.org/10.1038/s41598-022-10358-x
Vujović, Ž. (2021). Classification Model Evaluation Metrics. International Journal of Advanced Computer Science and Applications, 12(6), 599–606. https://doi.org/10.14569/IJACSA.2021.0120670
Wang, S., Dai, Y., Shen, J., & Xuan, J. (2021). Research on expansion and classification of imbalanced data based on SMOTE algorithm. Scientific Reports, 11(1), 1–11. https://doi.org/10.1038/s41598-021-03430-5
Woodward, M. (2019). Cardiovascular disease and the female disadvantage. International Journal of Environmental Research and Public Health, 16(7). https://doi.org/10.3390/ijerph16071165
Xiao, H. H., Yang, W. K., Hu, J., Zhang, Y. P., Jing, L. J., & Chen, Z. Y. (2022). Significance and methodology: Preprocessing the big data for machine learning on TBM performance. Underground Space (China), 7(4), 680–701. https://doi.org/10.1016/j.undsp.2021.12.003
Yadav, D. C., & Pal, S. (2020). Prediction of heart disease using feature selection and random forest ensemble method. International Journal of Pharmaceutical Research, 12(4), 56–66. https://doi.org/10.31838/ijpr/2020.12.04.013
Yan, Y., Tan, M., Xu, Y., Cao, J., Ng, M., Min, H., & Wu, Q. (2019). Oversampling for imbalanced data via optimal transport. 33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, 5605–5612. https://doi.org/10.1609/aaai.v33i01.33015605
Zeng, Z., Chen, P. J., & Lew, A. A. (2020). From high-touch to high-tech: COVID-19 drives robotics adoption. Tourism Geographies, 22(3), 724–734. https://doi.org/10.1080/14616688.2020.1762118
Zhang, J., Chen, L., & Abid, F. (2019). Prediction of Breast Cancer from Imbalance Respect Using Cluster-Based Undersampling Method. Journal of Healthcare Engineering, 2019. https://doi.org/10.1155/2019/7294582
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Muhamad Indra, Siti Ernawati, Ilham Maulana

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
The Jurnal Riset Informatika has legal rules for accessing digital electronic articles uunder a Creative Commons Attribution-NonCommercial 4.0 International License . Articles published in Jurnal Riset Informatika, provide Open Access, for the purpose of scientific development, research, and libraries.