IMPROVING IMAGE CLASSIFICATION ACCURACY WITH OVERSAMPLING AND DATA AUGMENTATION USING DEEP LEARNING: A CASE STUDY ON THE SIMPSONS CHARACTERS DATASET

Ilham Maulana; Siti Ernawati; Muhammad Indra

doi:10.34288/jri.v6i4.348

Authors

Ilham Maulana Universitas Nusa Mandiri
Siti Ernawati Universitas Nusa Mandiri
Muhammad Indra Universitas Nusa Mandiri

(*) Corresponding Author

DOI:

https://doi.org/10.34288/jri.v6i4.348

Keywords:

Imbalanced Data, Oversampling , Data Augmentation, Convolutional Neural Network (CNN)

Abstract

The issue of data imbalance in image classification often hinders deep learning models from making accurate predictions, especially for minority classes. This study introduces AugOS-CNN (Augmentation and Over Sampling with CNN), a novel approach that combines oversampling and data augmentation techniques to address data imbalance. The The Simpsons Characters dataset is used in this study, featuring five main character classes: Bart, Homer, Agnes, Carl, and Apu. The number of samples in each class is balanced to 2,067 using an augmentation method based on Augmentor. The proposed model integrates oversampling and augmentation steps with a Convolutional Neural Network (CNN) architecture to improve classification accuracy. Evaluation results show that the AugOS-CNN model achieves the highest accuracy of 96%, outperforming the baseline CNN approach without data balancing techniques, which only reaches 91%. These findings demonstrate that the AugOS-CNN model effectively enhances image classification performance on datasets with imbalanced class distributions, contributing to the development of more robust deep learning methods for addressing data imbalance issues.

Downloads

Download data is not yet available.

References

Abayomi-Alli, O. O., Damaševičius, R., Misra, S., Maskeliūnas, R., & Abayomi-Alli, A. (2021). Malignant skin melanoma detection using image augmentation by oversampling in nonlinear lower-dimensional embedding manifold. Turkish Journal of Electrical Engineering and Computer Sciences, 29(8), 2600–2614. https://doi.org/10.3906/elk-2101-133

ARPACI, S. A., & VARLI, S. (2021). LUPU-Net: a new improvement proposal for encoder-decoder architecture. International Advanced Researches and Engineering Journal, 5(3), 352–361. https://doi.org/10.35860/iarej.939243

Díez López, C., Montiel González, D., Vidaki, A., & Kayser, M. (2022). Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning. Frontiers in Microbiology, 13(July), 1–12. https://doi.org/10.3389/fmicb.2022.886201

Eom, G., & Byeon, H. (2023). Searching for Optimal Oversampling to Process Imbalanced Data: Generative Adversarial Networks and Synthetic Minority Over-Sampling Technique. Mathematics-MDPI, 11(16), 3605. https://doi.org/10.3390/math11163605

Fernández, A., García, S., Herrera, F., & Chawla, N. V. (2018). SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary. Journal of Artificial Intelligence Research, 61, 863–905. https://doi.org/10.1613/jair.1.11192

Fuadi, E. H., Ruslim, A. R., Wardhana, P. W. K., & Yudistira, N. (2024). Gated Self-supervised Learning for Improving Supervised Learning. Proceedings - 2024 IEEE Conference on Artificial Intelligence, CAI 2024, 611–615. https://doi.org/10.1109/CAI59869.2024.00120

Gao, X., Jamil, N., Ramli, M. I., & Ariffin, S. M. Z. S. Z. (2024). A Comparative Analysis of Combination of CNN-Based Models with Ensemble Learning on Imbalanced Data. International Journal on Informatics Visualization, 8(1), 456–464. https://doi.org/10.62527/joiv.8.1.2194

Harsa Pratama, N., Rachmawati, E., & Kosala, G. (n.d.). CLASSIFICATION OF DOG BREEDS FROM SPORTING GROUPS USING CONVOLUTIONAL NEURAL NETWORK.

Hayaty, M., Muthmainah, S., & Ghufran, S. M. (2021). Random and Synthetic Over-Sampling Approach to Resolve Data Imbalance in Classification. International Journal of Artificial Intelligence Research, 4(2), 86. https://doi.org/10.29099/ijair.v4i2.152

Huang, P., Shang, J., Xu, Y., Hu, Z., Zhang, K., Dai, J., & Yan, H. (2023). Anomaly detection in radiotherapy plans using deep autoencoder networks. Frontiers in Oncology, 13(March), 1–8. https://doi.org/10.3389/fonc.2023.1142947

Iskandar, D. A., & Salam, A. (2024). Evaluasi Performa Oversampling dan Augmentasi pada Klasifikasi Penyakit Kulit Menerapkan Convolutional Neural Network. Jurnal Media Informatika Budidarma, 8(1), 240. https://doi.org/10.30865/mib.v8i1.7119

Kato, H., Osuge, K., Haruta, S., & Sasase, I. (2020). A preprocessing by using multiple steganography for intentional image downsampling on CNN-based steganalysis. IEEE Access, 8, 195578–195593. https://doi.org/10.1109/ACCESS.2020.3033814

Li, Z., Jiang, X., Jia, X., Duan, X., Wang, Y., & Mu, J. (2022). Classification Method of Significant Rice Pests Based on Deep Learning. Agronomy, 12(9). https://doi.org/10.3390/agronomy12092096

Liu, L., Chen, J., Fieguth, P., Zhao, G., Chellappa, R., & Pietikäinen, M. (2019). From BoW to CNN: Two Decades of Texture Representation for Texture Classification. International Journal of Computer Vision, 127(1), 74–109. https://doi.org/10.1007/s11263-018-1125-z

Mahmudah, K. R., Indriani, F., Takemori‐sakai, Y., Iwata, Y., Wada, T., & Satou, K. (2021). Classification of imbalanced data represented as binary features. Applied Sciences (Switzerland), 11(17). https://doi.org/10.3390/app11177825

Meng, H., Li, C., Liu, Y., Gong, Y., He, W., & Zou, M. (2023). Corn Land Extraction Based on Integrating Optical and SAR Remote Sensing Images. Land, 12(2). https://doi.org/10.3390/land12020398

Mirs, E. (2010). Oversampled-Based Approach to Overcome Imbalance Data in the Classification of Apple Leaf Disease with SMOTE. Romanian Journal Ofapplied Science and Technology, XIII(3), 254–260.

NAHZAT, S., & YAĞANOĞLU, M. (2021). Makine Öğrenimi Sınıflandırma Algoritmalarını Kullanarak Diyabet Tahmini. European Journal of Science and Technology, 24, 53–59. https://doi.org/10.31590/ejosat.899716

Okawa, T., Mizuno, T., Hanabusa, S., Ikeda, T., Mizokami, F., Koseki, T., Takahashi, K., Yuzawa, Y., Tsuboi, N., Yamada, S., & Kameya, Y. (2022). Prediction model of acute kidney injury induced by cisplatin in older adults using a machine learning algorithm. PLoS ONE, 17(1 January), 1–10. https://doi.org/10.1371/journal.pone.0262021

Rustam, Z., Utami, D. A., Hidayat, R., Pandelaki, J., & Nugroho, W. A. (2019). Hybrid preprocessing method for support vector machine for classification of imbalanced cerebral infarction datasets. International Journal on Advanced Science, Engineering and Information Technology, 9(2), 685–691. https://doi.org/10.18517/ijaseit.9.2.8615

Shafique, S., & Tehsin, S. (2018). Acute lymphoblastic leukemia detection and classification of its subtypes using pretrained deep convolutional neural networks. Technology in Cancer Research and Treatment, 17, 1–7. https://doi.org/10.1177/1533033818802789

Shorten, C., & Khoshgoftaar, T. M. (2019). A survey on Image Data Augmentation for Deep Learning. Journal of Big Data, 6(1). https://doi.org/10.1186/s40537-019-0197-0

Sichevskyi, S. (2022). Machine Learning Techniques for Increasing Efficiency of the Robot’s Sensor and Control Information Processing †. Sensors - MDPI, 22, 2–31. https://doi.org/https://doi.org/10.3390/s22031062

Tan, L., Lu, J., & Jiang, H. (2021). Tomato Leaf Diseases Classification Based on Leaf Images: A Comparison between Classical Machine Learning and Deep Learning Methods. AgriEngineering, 3(3), 542–558. https://doi.org/10.3390/agriengineering3030035

Xie, J., Wang, Z., Yu, Z., Guo, B., & Zhou, X. (2021). Ischemic stroke prediction by exploring sleep related features. Applied Sciences (Switzerland), 11(5), 1–25. https://doi.org/10.3390/app11052083