IMPROVING IMAGE CLASSIFICATION ACCURACY WITH OVERSAMPLING AND DATA AUGMENTATION USING DEEP LEARNING: A CASE STUDY ON THE SIMPSONS CHARACTERS DATASET
DOI:
https://doi.org/10.34288/jri.v6i4.348Keywords:
Imbalanced Data, Oversampling , Data Augmentation, Convolutional Neural Network (CNN)Abstract
The issue of data imbalance in image classification often hinders deep learning models from making accurate predictions, especially for minority classes. This study introduces AugOS-CNN (Augmentation and Over Sampling with CNN), a novel approach that combines oversampling and data augmentation techniques to address data imbalance. The The Simpsons Characters dataset is used in this study, featuring five main character classes: Bart, Homer, Agnes, Carl, and Apu. The number of samples in each class is balanced to 2,067 using an augmentation method based on Augmentor. The proposed model integrates oversampling and augmentation steps with a Convolutional Neural Network (CNN) architecture to improve classification accuracy. Evaluation results show that the AugOS-CNN model achieves the highest accuracy of 96%, outperforming the baseline CNN approach without data balancing techniques, which only reaches 91%. These findings demonstrate that the AugOS-CNN model effectively enhances image classification performance on datasets with imbalanced class distributions, contributing to the development of more robust deep learning methods for addressing data imbalance issues.
Downloads
References
Abayomi-Alli, O. O., Damaševičius, R., Misra, S., Maskeliūnas, R., & Abayomi-Alli, A. (2021). Malignant skin melanoma detection using image augmentation by oversampling in nonlinear lower-dimensional embedding manifold. Turkish Journal of Electrical Engineering and Computer Sciences, 29(8), 2600–2614. https://doi.org/10.3906/elk-2101-133
ARPACI, S. A., & VARLI, S. (2021). LUPU-Net: a new improvement proposal for encoder-decoder architecture. International Advanced Researches and Engineering Journal, 5(3), 352–361. https://doi.org/10.35860/iarej.939243
Díez López, C., Montiel González, D., Vidaki, A., & Kayser, M. (2022). Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning. Frontiers in Microbiology, 13(July), 1–12. https://doi.org/10.3389/fmicb.2022.886201
Eom, G., & Byeon, H. (2023). Searching for Optimal Oversampling to Process Imbalanced Data: Generative Adversarial Networks and Synthetic Minority Over-Sampling Technique. Mathematics-MDPI, 11(16), 3605. https://doi.org/10.3390/math11163605
Fernández, A., García, S., Herrera, F., & Chawla, N. V. (2018). SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary. Journal of Artificial Intelligence Research, 61, 863–905. https://doi.org/10.1613/jair.1.11192
Fuadi, E. H., Ruslim, A. R., Wardhana, P. W. K., & Yudistira, N. (2024). Gated Self-supervised Learning for Improving Supervised Learning. Proceedings - 2024 IEEE Conference on Artificial Intelligence, CAI 2024, 611–615. https://doi.org/10.1109/CAI59869.2024.00120
Gao, X., Jamil, N., Ramli, M. I., & Ariffin, S. M. Z. S. Z. (2024). A Comparative Analysis of Combination of CNN-Based Models with Ensemble Learning on Imbalanced Data. International Journal on Informatics Visualization, 8(1), 456–464. https://doi.org/10.62527/joiv.8.1.2194
Harsa Pratama, N., Rachmawati, E., & Kosala, G. (n.d.). CLASSIFICATION OF DOG BREEDS FROM SPORTING GROUPS USING CONVOLUTIONAL NEURAL NETWORK.
Hayaty, M., Muthmainah, S., & Ghufran, S. M. (2021). Random and Synthetic Over-Sampling Approach to Resolve Data Imbalance in Classification. International Journal of Artificial Intelligence Research, 4(2), 86. https://doi.org/10.29099/ijair.v4i2.152
Huang, P., Shang, J., Xu, Y., Hu, Z., Zhang, K., Dai, J., & Yan, H. (2023). Anomaly detection in radiotherapy plans using deep autoencoder networks. Frontiers in Oncology, 13(March), 1–8. https://doi.org/10.3389/fonc.2023.1142947
Iskandar, D. A., & Salam, A. (2024). Evaluasi Performa Oversampling dan Augmentasi pada Klasifikasi Penyakit Kulit Menerapkan Convolutional Neural Network. Jurnal Media Informatika Budidarma, 8(1), 240. https://doi.org/10.30865/mib.v8i1.7119
Kato, H., Osuge, K., Haruta, S., & Sasase, I. (2020). A preprocessing by using multiple steganography for intentional image downsampling on CNN-based steganalysis. IEEE Access, 8, 195578–195593. https://doi.org/10.1109/ACCESS.2020.3033814
Li, Z., Jiang, X., Jia, X., Duan, X., Wang, Y., & Mu, J. (2022). Classification Method of Significant Rice Pests Based on Deep Learning. Agronomy, 12(9). https://doi.org/10.3390/agronomy12092096
Liu, L., Chen, J., Fieguth, P., Zhao, G., Chellappa, R., & Pietikäinen, M. (2019). From BoW to CNN: Two Decades of Texture Representation for Texture Classification. International Journal of Computer Vision, 127(1), 74–109. https://doi.org/10.1007/s11263-018-1125-z
Mahmudah, K. R., Indriani, F., Takemori‐sakai, Y., Iwata, Y., Wada, T., & Satou, K. (2021). Classification of imbalanced data represented as binary features. Applied Sciences (Switzerland), 11(17). https://doi.org/10.3390/app11177825
Meng, H., Li, C., Liu, Y., Gong, Y., He, W., & Zou, M. (2023). Corn Land Extraction Based on Integrating Optical and SAR Remote Sensing Images. Land, 12(2). https://doi.org/10.3390/land12020398
Mirs, E. (2010). Oversampled-Based Approach to Overcome Imbalance Data in the Classification of Apple Leaf Disease with SMOTE. Romanian Journal Ofapplied Science and Technology, XIII(3), 254–260.
NAHZAT, S., & YAĞANOĞLU, M. (2021). Makine Öğrenimi Sınıflandırma Algoritmalarını Kullanarak Diyabet Tahmini. European Journal of Science and Technology, 24, 53–59. https://doi.org/10.31590/ejosat.899716
Okawa, T., Mizuno, T., Hanabusa, S., Ikeda, T., Mizokami, F., Koseki, T., Takahashi, K., Yuzawa, Y., Tsuboi, N., Yamada, S., & Kameya, Y. (2022). Prediction model of acute kidney injury induced by cisplatin in older adults using a machine learning algorithm. PLoS ONE, 17(1 January), 1–10. https://doi.org/10.1371/journal.pone.0262021
Rustam, Z., Utami, D. A., Hidayat, R., Pandelaki, J., & Nugroho, W. A. (2019). Hybrid preprocessing method for support vector machine for classification of imbalanced cerebral infarction datasets. International Journal on Advanced Science, Engineering and Information Technology, 9(2), 685–691. https://doi.org/10.18517/ijaseit.9.2.8615
Shafique, S., & Tehsin, S. (2018). Acute lymphoblastic leukemia detection and classification of its subtypes using pretrained deep convolutional neural networks. Technology in Cancer Research and Treatment, 17, 1–7. https://doi.org/10.1177/1533033818802789
Shorten, C., & Khoshgoftaar, T. M. (2019). A survey on Image Data Augmentation for Deep Learning. Journal of Big Data, 6(1). https://doi.org/10.1186/s40537-019-0197-0
Sichevskyi, S. (2022). Machine Learning Techniques for Increasing Efficiency of the Robot’s Sensor and Control Information Processing †. Sensors - MDPI, 22, 2–31. https://doi.org/https://doi.org/10.3390/s22031062
Tan, L., Lu, J., & Jiang, H. (2021). Tomato Leaf Diseases Classification Based on Leaf Images: A Comparison between Classical Machine Learning and Deep Learning Methods. AgriEngineering, 3(3), 542–558. https://doi.org/10.3390/agriengineering3030035
Xie, J., Wang, Z., Yu, Z., Guo, B., & Zhou, X. (2021). Ischemic stroke prediction by exploring sleep related features. Applied Sciences (Switzerland), 11(5), 1–25. https://doi.org/10.3390/app11052083
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Ilham Maulana, Siti Ernawati, Muhammad Indra

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
The Jurnal Riset Informatika has legal rules for accessing digital electronic articles uunder a Creative Commons Attribution-NonCommercial 4.0 International License . Articles published in Jurnal Riset Informatika, provide Open Access, for the purpose of scientific development, research, and libraries.