TRANSFER LEARNING ARCHITECTURE SELECTION FOR  REMOTE SENSING SCENE CLASSIFICATION

Akhiyar Waladi; Hasanatul Iftitah

doi:10.34288/jri.v8i3.515

Authors

Akhiyar Waladi Universitas Jambi
Hasanatul Iftitah Universitas Jambi

(*) Corresponding Author

DOI:

https://doi.org/10.34288/jri.v8i3.515

Keywords:

Remote Sensing, Scene Classification, Transfer Learning, Vision Transformer, Benchmark Comparison

Abstract

Selecting a deep learning architecture for classifying remote sensing scenes usually involves comparing published accuracy across papers that each use different training protocols, making it unclear whether accuracy gaps reflect architecture or training differences. We isolate the architecture variable by evaluating eight models from three design families, five classical CNNs (ResNet-50, ResNet-101, DenseNet-121, EfficientNet-B0, EfficientNet-B3), two vision transformers (ViT-B/16, Swin Transformer), and one modernized CNN (ConvNeXt-Tiny), under identical training conditions on EuroSAT (10 classes, 27,000 Sentinel-2 patches) and UC Merced (21 classes, 2,100 aerial photographs). Every model shares the same ImageNet-1K initialization, AdamW optimizer, augmentation pipeline, and early stopping rule. ConvNeXt-Tiny reached the highest accuracy on EuroSAT (99.11%) and Swin-T on UC Merced (99.76%), but the accuracy range on EuroSAT was only 0.41 percentage points (1.66 on UC Merced). McNemar's test confirmed that most pairwise differences were not significant. EfficientNet-B0, the smallest model at 4.0M parameters, reached 98.76% and 99.52% while using 21x fewer parameters than ViT-B/16. On these two well-studied benchmarks, a single uniform training configuration was sufficient to bring all architectures to near-identical performance. This convergence, observed under one fixed protocol and a single data partition, suggests that on saturated classification tasks the choice of architecture may be secondary to the choice of training procedure. Whether this convergence holds on harder benchmarks, under architecture-specific optimal configurations, or with domain-specific pretraining remains to be tested

Downloads

Download data is not yet available.

References

Adegun, A. A., Viriri, S., & Tapamo, J. R. (2023). Review of deep learning methods for remote sensing satellite images classification: experimental survey and comparative analysis. Journal of Big Data, 10, 93. https://doi.org/10.1186/s40537-023-00772-x DOI: https://doi.org/10.1186/s40537-023-00772-x

Aleissaee, A. A., Kumar, A., Anwer, R. M., Khan, S., Cholakkal, H., Xia, G.-S., & Khan, F. S. (2023). Transformers in Remote Sensing: A Survey. Remote Sensing, 15(7), 1860. https://doi.org/10.3390/rs15071860 DOI: https://doi.org/10.3390/rs15071860

Bazi, Y., Bashmal, L., Rahhal, M. M. A., Dayil, R. A., & Ajlan, N. A. (2021). Vision Transformers for Remote Sensing Image Classification. Remote Sensing, 13(3), 516. https://doi.org/10.3390/rs13030516 DOI: https://doi.org/10.3390/rs13030516

Cheng, G., Han, J., & Lu, X. (2017). Remote sensing image scene classification: benchmark and state of the art. Proceedings of the IEEE, 105(10), 1865–1883. https://doi.org/10.1109/JPROC.2017.2675998 DOI: https://doi.org/10.1109/JPROC.2017.2675998

Cheng, G., Xie, X., Han, J., Guo, L., & Xia, G.-S. (2020). Remote sensing image scene classification meets deep learning: challenges, methods, benchmarks, and opportunities. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 13, 3735–3756. https://doi.org/10.1109/JSTARS.2020.3005403 DOI: https://doi.org/10.1109/JSTARS.2020.3005403

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46. https://doi.org/10.1177/001316446002000104 DOI: https://doi.org/10.1177/001316446002000104

Congalton, R. G. (1991). A review of assessing the accuracy of classifications of remotely sensed data. Remote Sensing of Environment, 37(1), 35–46. https://doi.org/10.1016/0034-4257(91)90048-B DOI: https://doi.org/10.1016/0034-4257(91)90048-B

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: a large-scale hierarchical image database. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 248–255. https://doi.org/10.1109/CVPR.2009.5206848 DOI: https://doi.org/10.1109/CVPR.2009.5206848

Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10(7), 1895–1923. https://doi.org/10.1162/089976698300017197 DOI: https://doi.org/10.1162/089976698300017197

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. https://arxiv.org/abs/2010.11929

Foody, G. M. (2002). Status of land cover classification accuracy assessment. Remote Sensing of Environment, 80(1), 185–201. https://doi.org/10.1016/S0034-4257(01)00295-4 DOI: https://doi.org/10.1016/S0034-4257(01)00295-4

Foody, G. M. (2004). Thematic map comparison: evaluating the statistical significance of differences in classification accuracy. Photogrammetric Engineering & Remote Sensing, 70(5), 627–633. https://doi.org/10.14358/PERS.70.5.627 DOI: https://doi.org/10.14358/PERS.70.5.627

Goldblum, M., Souri, H., Ni, R., Shu, M., Prabhu, V., Somepalli, G., Chattopadhyay, P., Ibrahim, M., Bardes, A., Hoffman, J., Chellappa, R., Wilson, A. G., & Goldstein, T. (2023). Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Advances in Neural Information Processing Systems (Vol. 36, pp. 29343–29371). Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2023/file/5d9571470bb750f0e2325a030016f63f-Paper-Datasets_and_Benchmarks.pdf

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–778. https://doi.org/10.1109/CVPR.2016.90 DOI: https://doi.org/10.1109/CVPR.2016.90

Helber, P., Bischke, B., Dengel, A., & Borth, D. (2019). EuroSAT: a novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7), 2217–2226. https://doi.org/10.1109/JSTARS.2019.2918242 DOI: https://doi.org/10.1109/JSTARS.2019.2918242

Hong, D., Han, Z., Yao, J., Gao, L., Zhang, B., Plaza, A., & Chanussot, J. (2022). SpectralFormer: rethinking hyperspectral image classification with transformers. IEEE Transactions on Geoscience and Remote Sensing, 60, 1–15. https://doi.org/10.1109/TGRS.2021.3130716 DOI: https://doi.org/10.1109/TGRS.2021.3130716

Huang, G., Liu, Z., van der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2261–2269. https://doi.org/10.1109/CVPR.2017.243 DOI: https://doi.org/10.1109/CVPR.2017.243

Li, Y., Zhang, H., Xue, X., Jiang, Y., & Shen, Q. (2018). Deep learning for remote sensing image classification: a survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(6), e1264. https://doi.org/10.1002/widm.1264 DOI: https://doi.org/10.1002/widm.1264

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin Transformer: hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 10012–10022. https://doi.org/10.1109/ICCV48922.2021.00986 DOI: https://doi.org/10.1109/ICCV48922.2021.00986

Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A convnet for the 2020s. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11976–11986. DOI: https://doi.org/10.1109/CVPR52688.2022.01167

Loshchilov, I., & Hutter, F. (2019). Decoupled weight decay regularization. Proceedings of the International Conference on Learning Representations (ICLR).

Ma, L., Liu, Y., Zhang, X., Ye, Y., Yin, G., & Johnson, B. A. (2019). Deep learning in remote sensing applications: a meta-analysis and review. ISPRS Journal of Photogrammetry and Remote Sensing, 152, 166–177. https://doi.org/10.1016/j.isprsjprs.2019.04.015 DOI: https://doi.org/10.1016/j.isprsjprs.2019.04.015

Mañas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., & Rodriguez, P. (2021). Seasonal Contrast: Unsupervised Pre-Training from Uncurated Remote Sensing Data. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 9414–9423. https://doi.org/10.1109/ICCV48922.2021.00928 DOI: https://doi.org/10.1109/ICCV48922.2021.00928

McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2), 153–157. https://doi.org/10.1007/BF02295996 DOI: https://doi.org/10.1007/BF02295996

Neumann, M., Pinto, A. S., Zhai, X., & Houlsby, N. (2019). In-domain representation learning for remote sensing. ArXiv Preprint ArXiv:1911.06721.

Neyshabur, B., Sedghi, H., & Zhang, C. (2020). What is Being Transferred in Transfer Learning? Advances in Neural Information Processing Systems (NeurIPS), 33.

Nogueira, K., Penatti, O. A. B., & dos Santos, J. A. (2017). Towards better exploiting convolutional neural networks for remote sensing scene classification. Pattern Recognition, 61, 539–556. https://doi.org/10.1016/jß.patcog.2016.07.001 DOI: https://doi.org/10.1016/j.patcog.2016.07.001

Pan, S. J., & Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10), 1345–1359. https://doi.org/10.1109/TKDE.2009.191 DOI: https://doi.org/10.1109/TKDE.2009.191

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., … Chintala, S. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, & R. Garnett (Eds.), Advances in Neural Information Processing Systems (Vol. 32). Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf

Tan, M., & Le, Q. (2019). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In K. Chaudhuri & R. Salakhutdinov (Eds.), Proceedings of the 36th International Conference on Machine Learning (Vol. 97, pp. 6105–6114). PMLR. https://proceedings.mlr.press/v97/tan19a.html

Wang, D., Zhang, Q., Xu, Y., Zhang, J., & Zhong, Y. (2023). Advancing plain vision transformer toward remote sensing foundation model. IEEE Transactions on Geoscience and Remote Sensing, 61, 1–15. https://doi.org/10.1109/TGRS.2022.3222818 DOI: https://doi.org/10.1109/TGRS.2022.3222818

Wightman, R., Touvron, H., & Jégou, H. (2021). ResNet strikes back: an improved training procedure in timm. ArXiv Preprint ArXiv:2110.00476.

Xia, G.-S., Hu, J., Hu, F., Shi, B., Bai, X., Zhong, Y., Zhang, L., & Lu, X. (2017). AID: a benchmark data set for performance evaluation of aerial scene classification. IEEE Transactions on Geoscience and Remote Sensing, 55(7), 3965–3981. https://doi.org/10.1109/TGRS.2017.2685945 DOI: https://doi.org/10.1109/TGRS.2017.2685945

Yang, Y., & Newsam, S. (2010). Bag-of-visual-words and spatial extensions for land-use classification. Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, 270–279. https://doi.org/10.1145/1869790.1869829 DOI: https://doi.org/10.1145/1869790.1869829

Zhu, X. X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., & Fraundorfer, F. (2017). Deep learning in remote sensing: a comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine, 5(4), 8–36. https://doi.org/10.1109/MGRS.2017.2762307 DOI: https://doi.org/10.1109/MGRS.2017.2762307