Integration of OCR Technology with ETL Processes for Automating Data Pipeline of Financial Disbursement Documents at BPS Sukabumi Regency

Authors

(*) Corresponding Author

DOI:

https://doi.org/10.34288/jri.v7i4.395

Keywords:

Big Data, Optical Character Recognition (OCR), Extract Transform Load (ETL), Automated Data Pipeline, Financial Disbursement

Abstract

In the digital era, managing archival data poses challenges for many institutions, including Badan Pusat Statistik (BPS) of Sukabumi Regency, especially when dealing with unstructured PDF documents. This study develops a data pipeline by effectively integrating Optical Character Recognition (OCR) technology with Extract, Transform, Load (ETL) processes. Unstructured data from financial disbursement documents, such as SPM and SP2D, were automatically extracted with high accuracy, achieving an average of 98.52% for SPM using a combination of OCR and PDFPlumber, and 100% for SP2D extracted using PDFPlumber. Extraction results were stored in a data warehouse, then transformed using Apache Spark and loaded into data marts. ETL process was automated using Apache Airflow, which operated reliably according to dependencies. The processed data were presented through an interactive Looker Studio dashboard in real-time, supporting efficient archive management and more informed decision-making. This study not only provides a solution to existing archival management problems but also opens opportunities for further development in the application of big data technologies and business process automation in public sector.

Downloads

Download data is not yet available.

References

Adrezo, M., & Ermatita, E. (2023). Implementasi Pentaho Pada Perancangan Data Warehouse Perusahaan Jasa Pengiriman (PT. Tiki Palembang). Jurnal Teknik Informatika Dan Sistem Informasi, 10(2), 2407–4322. Retrieved from http://jurnal.mdp.ac.id

Arjan Rangkuti, P., Zihni Athallah, M., Harwani Barus, T., Afisa Rani, N., & Ikhsan Setiawan, M. (2023). Perbandingan Performa Apache Impala Dengan Apache Spark Dalam Mengeksekusi Kueri. Journal of Network and Computer Applications, 2(2), 12–22. Retrieved from https://jurnal.netplg.com/

Atsila Imanda, R., Suroso, S., Fauzi, A., Simanjuntak, H. F., Azizah, Z., Destianty, A., … Zarka Zahira Shaffa, G. (2024). Pengaruh Data Warehouse Terhadap Pengambilan Keputusan. Jurnal Portofolio : Jurnal Manajemen Dan Bisnis, 3(1), 31–39. Retrieved from https://www.jurnalprisanicendekia.com/index.php/portofolio/article/view/282

Bakari, R. I., Karamoy, H., & Lambey, R. (2022). Analisis Prosedur Pencairan Dana Langsung ( LS ) Pada Kantor Pelayanan Perbendaharaan Negara ( KPPN ) Manado. Jurnal LPPM Bidang EkoSosKum(Ekonomi, Sosial, Budaya & Hukum), 5(2), 941–948.

Darmansah, T., Agung, M. N., Hasbih, S. S., & Lucky Tirta, N. (2023). Tantangan dan Solusi dalam Pengelolaan arsip di era digital. Jurnal Ekonomi Dan Bisnis Digital, 02(01), 5.

Eeti, S., GOEL, E. L., & KUSHWAHA, D. G. S. (2022). Efficient ETL Processes : A Comparative Study of Apache Airflow vs. Traditional Methods, 9(8).

Fauzi, A., Noor, A. W., Ardyansyah, L. N., & Semesta, J. B. (2023). Kajian Penerapan Arsitektur Data Warehouse dalam Bisnis Intelijen pada Pengambilan Keputusan Bisnis. JEMSI (Jurnal Ekonomi Manajemen Sistem Informasi), 4(5), 868–875. Retrieved from https://www.dinastirev.org/JEMSI/article/download/1501/936

Irimia, C., Harbuzariu, F., Hazi, I., & Iftene, A. (2022). Official Document Identification and Data Extraction using Templates and OCR. Procedia Computer Science, 207(Kes), 1571–1580. doi:10.1016/j.procs.2022.09.214

Jamaluddin, J., Nurfadila, N., & Isgunandar, I. (2023). Effectiveness of Archives Systems in Administrative Governance at the Maccini Sombala Village Head Office, Makassar City. Pinisi Journal of Education and Management, 2(3), 265. doi:10.26858/pjoem.v2i3.56172

Kamdan, Somantri, Sundayana, M. G., & Kharisma, I. L. (2023). Rancang Bangun Layanan Private cloud Berbasis Infrastructure as a Service Menggunakan OpenStack dengan Metode Network Development Life Cycle(NDLC). KLIK: Kajian Ilmiah Informatika Dan Komputer, 4(1), 252–262. doi:10.30865/klik.v4i1.1001

Li, C., Chen, Y., & Shang, Y. (2022). A review of industrial big data for decision making in intelligent manufacturing. Engineering Science and Technology, an International Journal, 29, 101021. doi:10.1016/j.jestch.2021.06.001

Murtiwiyati, Hansel, A., & Leli, S. (2024). Implementasi Data Warehouse dan Business Intelligence Menggunakan Pentaho dan Metabase untuk Membuat Dahboard Visualisasi Kinerja Penjualan E-Commerce Wish. Jurnal Penelitian Teknologi Informasi Dan Sains, Volume. 2, 9.

Riza, N., Aulia, M. Z., Kolin, P. B., & Mustaqim, K. (2024). ANALISIS FAKTOR PENGARUH TERHADAP PENGHASILAN PROFESI DATA ENGINEER MENGGUNAKAN METODE REGRESI LINEAR BERGANDA. JITET (Jurnal Informatika Dan Teknik Elektro Terapan), 13(1), 9.

Sanchez, E. (2022). What Is Batch ETL Processing? The Only Guide You Need. Retrieved 23 May 2025, from https://blog.skyvia.com/batch-etl-processing/

Siti Sarah Sobariah Lestari, Gina Purnama Insany, Dede Sukmawan, & Faiz Dzulfikar Yusuf. (2023). Mengatasi Permasalahan High Dimensional Space dalam Klasifikasi Multikelas Big Data pada Data Gambar dengan DCSVM. Jurnal RESTIKOM : Riset Teknik Informatika Dan Komputer, 5(3), 340–351. doi:10.52005/restikom.v5i3.259

Subekti, Z. M., Mukiman, K., Subandri, Sulthon, M. L., Sulistiyono, A., & Putra, R. E. (2024). RANCANG BANGUN INFRASTRUKTUR WEB SERVER. JURNAL TRIDI, 2(1), 144–151.

Suriansyah, B., Mz, L. F., Rachman, A. I., & Pratiwi, G. (2025). Rekontruksi Arsitektur DataBase untuk Peningkatan Proses Load Data. JURNAL MEDIA INFORMATIKA [JUMIN], 6(2), 1455–1460.

Syuhada, H., Hidayat, S., Mulyati, S., & Giri Persada, A. (2023). Pengembangan Gamifikasi Pada Pelajaran Matematika Sd Dengan Metode Addie Untuk Meningkatkan Minat Belajar Siswa. Rabit : Jurnal Teknologi Dan Sistem Informasi Univrab, 9(1), 1–14. doi:10.36341/rabit.v9i1.466

Wahyudi, E. E., Auzan, M., Dharmawan, A., Nuryanto, D. E., Susyanto, N., Samodra, G., & Hadmoko, D. S. (2022). Akuisisi Data Prediksi Curah Hujan Secara Periodik Menggunakan Apache Airflow. Journal of Informatics, Information System, Software Engineering and Applications (INISTA), 4(2), 1–12. doi:10.20895/inista.v4i2.574

Wibawa, C., Wirawan, S., Mustikasari, M., & Anggraeni, D. T. (2022). Komparasi Kecepatan Hadoop Mapreduce Dan Apache Spark Dalam Mengolah Data Teks. Jurnal Ilmiah Matrik, 24(1), 10–20. doi:10.33557/jurnalmatrik.v24i1.1649

Wulandari, F. A., & Maula, K. A. (2022). Analisis Sistem Akuntansi Penggajian Pada Badan Pusat Statistik ( BPS ) Kabupaten Bekasi PENDAHULUAN. Jurnal Mirai Management, 7(2), 526–538. doi:10.37531/mirai.v7i2.2762

Yamjala, H. (2024). The Role of Data Engineering in AI and Machine Learning Projects. Retrieved 19 May 2025, from https://www.dataversity.net/the-role-of-data-engineering-in-ai-and-machine-learning-projects/

Yusuf Alfiansyah, F., & Arisandi, D. (2023). Perancangan Dashboard Monitoring Status Gizi Balita di Puskesmas Sukanagalih. Jurnal Ilmiah Teknik Informatika Dan Sistem Informasi, 1–11.

Zulfadli, & Syahputra, R. (2024). SYSTEMATIC LITERATURE REVIEW: INTEGRATION OF BIG DATA AND ARTIFICIAL INTELLIGENCE. JURNAL TEKNISI, 4(2), 40–47.

Downloads

Published

2025-09-12

How to Cite

Muhammad Raihan Izharul Haq, Gina Purnama Insany, & Somantri. (2025). Integration of OCR Technology with ETL Processes for Automating Data Pipeline of Financial Disbursement Documents at BPS Sukabumi Regency. Jurnal Riset Informatika, 7(4), 317–326. https://doi.org/10.34288/jri.v7i4.395