ANALYSIS OF CLASSIFICATION ALGORITHM IN UNBALANCED DIABETES DATASET
DOI:
https://doi.org/10.34288/jri.v8i1.458Keywords:
Imbalanced Classification, Classification Algorithm, Over Sampling, Under Sampling, DiabetesAbstract
Diabetes mellitus is a metabolic disease that is spreading rapidly and has the potential to be life-threatening worldwide. This condition occurs when the body experiences a decline in its ability to process glucose, triggering metabolic disorders. The use of machine learning algorithms is one effective approach to predicting or detecting diabetes based on the severity of a patient's symptoms. This study uses the Diabetes dataset from Kaggle and compares the performance of several classification algorithms in unbalanced data conditions and after data balancing using the SMOTE, Random Under Sampling, Random Over Sampling, and Near Miss resampling techniques. The results show that model performance is greatly influenced by data balance conditions and the resampling method used. In the original unbalanced data condition, Artificial Neural Network (ANN) provided the best results with the highest accuracy of 96.98%, indicating that ANN is the most adaptive to class imbalance. After resampling, the performance pattern changed: with SMOTE, Random Under Sampling, and Random Over Sampling, the Random Forest algorithm consistently produced the highest accuracy of 96.52%, 89.84%, and 96.26%, respectively, demonstrating its superiority in utilizing balanced data. Meanwhile, in the Near Miss method, the best performance was achieved by Logistic Regression with an accuracy of 94.41%, indicating that minority sample selection based on proximity is more suitable for linear models. Therefore, selecting the right combination of resampling methods and machine learning algorithms is an important factor in obtaining optimal diabetes predictions.
Downloads
References
Bathla, G., Kumar, S., Garg, H., & Saini, D. (2024). Artificial Intelligence in Healthcare. CRC Press. https://doi.org/10.1201/9781003522096
Chauhan, A. S., Varre, M. S., Izuora, K., Trabia, M. B., & Dufek, J. S. (2023). Prediction of Diabetes Mellitus Progression Using Supervised Machine Learning. Sensors, 23(10), 4658. https://doi.org/10.3390/s23104658
Dubey, Y., Wankhede, P., Borkar, T., Borkar, A., & Mitra, K. (2021). Diabetes Prediction and Classification using Machine Learning Algorithms. 2021 IEEE International Conference on Biomedical Engineering, Computer and Information Technology for Health (BECITHCON), 60–63. https://doi.org/10.1109/BECITHCON54710.2021.9893653
Ece, S. (2021). Performance Analysis for Arrhythmia Classification using PSO , GWO and SVM. May, 67–72.
Elreedy, D., Atiya, A. F., & Kamalov, F. (2023). A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learning. Machine Learning. https://doi.org/10.1007/s10994-022-06296-4
Hussain, S., Ali, M., Naseem, U., Nezhadmoghadam, F., Jatoi, M. A., Gulliver, T. A., & Tamez-Peña, J. G. (2024). Breast cancer risk prediction using machine learning: a systematic review. Frontiers in Oncology, 14(March), 1–11. https://doi.org/10.3389/fonc.2024.1343627
Lin, P., Soto-Ferrari, M., & Chams-Anturi, O. (2022). A Logistic Regression Assessment to Measure Radiotherapy Clinical Pathway Concordance for Early Stages Breast Cancer Patients. Procedia Computer Science, 203, 559–564. https://doi.org/10.1016/j.procs.2022.07.080
Mekha, P. (2021). Image Classification o f Rice Leaf Diseases Using Random Forest Algorithm. 165–169.
Mohammed, R., Rawashdeh, J., & Abdullah, M. (2020). Machine Learning with Oversampling and Undersampling Techniques: Overview Study and Experimental Results. 2020 11th International Conference on Information and Communication Systems, ICICS 2020, 243–248. https://doi.org/10.1109/ICICS49469.2020.239556
Rachmawanto, E. H., Ignatius Moses Setiadi, D. R., Rijati, N., Susanto, A., Wahyu Mulyono, I. U., & Rahmalan, H. (2021). Attribute Selection Analysis for the Random Forest Classification in Unbalanced Diabetes Dataset. Proceedings - 2021 International Seminar on Application for Technology of Information and Communication: IT Opportunities and Creativities for Digital Innovation and Communication within Global Pandemic, ISemantic 2021, 82–86. https://doi.org/10.1109/iSemantic52711.2021.9573181
Ratmana, D. O., Shidik, G. F., Fanani, A. Z., & Pramunendar, R. A. (2020). Evaluation of Feature Selections on Movie Reviews Sentiment. 567–571.
Sadasivuni, K. K., Cabibihan, J.-J., A M Al-Ali, A. K., & Malik, R. A. (Eds.). (2022). Advanced Bioscience and Biosystems for Detection and Management of Diabetes (Vol. 13). Springer International Publishing. https://doi.org/10.1007/978-3-030-99728-1
Saleh, A. Y., & Brixtone Batou, B. (2022). Diabetes Mellitus Classification Using Hybrid Machine Learning With Stacking Technique. 2022 2nd International Conference on Emerging Smart Technologies and Applications (ESmarTA), 1–7. https://doi.org/10.1109/eSmarTA56775.2022.9935383
Setiyaningrum, Y. D. (2019). Classification of Twitter Contents using Chi-Square and K-Nearest Neighbour Algorithm. 2019 International Seminar on Application for Technology of Information and Communication (ISemantic), 1–4. https://doi.org/10.1109/ISEMANTIC.2019.8884290
Shaukat, Z., Zafar, W., Ahmad, W., Haq, I. U., Husnain, G., Al-Adhaileh, M. H., Ghadi, Y. Y., & Algarni, A. (2023). Revolutionizing Diabetes Diagnosis: Machine Learning Techniques Unleashed. Healthcare, 11(21), 2864. https://doi.org/10.3390/healthcare11212864
Singh, H. P., & Alhulail, H. N. (2022). Predicting Student-Teachers Dropout Risk and Early Identification : A Four-Step Logistic Regression Approach. IEEE Access, 10, 6470–6482. https://doi.org/10.1109/ACCESS.2022.3141992
Tumuluru, P., Daniel, R., Mahesh, G., Lakshmi, K. D., Mahidhar, P., & Kumar, M. V. (2023). Class Imbalance of Bio-Medical Data by Using PCA-Near Miss for Classification. 2023 5th International Conference on Inventive Research in Computing Applications (ICIRCA), 1832–1839. https://doi.org/10.1109/ICIRCA57980.2023.10220757
Vanneschi, L., & Silva, S. (2023). Artificial Neural Networks. Natural Computing Series, 161–204. https://doi.org/10.1007/978-3-031-17922-8_7
Wijaya, S. H., Pamungkas, G. T., & Sulthan, M. B. (2018). Improving Classifier Performance Using Particle Swarm Optimization on Heart Disease Detection. 603–608.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Ahmad Rifa'i, Herin Dwibima Aprianto, Lubna

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
The Jurnal Riset Informatika has legal rules for accessing digital electronic articles uunder a Creative Commons Attribution-NonCommercial 4.0 International License . Articles published in Jurnal Riset Informatika, provide Open Access, for the purpose of scientific development, research, and libraries.










