Analisis Komparatif Arsitektur CNN dan ViT dengan Variasi Augmentasi untuk Klasifikasi Karakter Mandarin

Main Article Content

Stefanus Eko Prasetyo
Haeruddin
Wilsen Lau

Abstract

Handwritten Chinese Character Recognition (HCCR) is a fundamental task in computer vision that poses significant challenges due to the large number of character classes and high variability of writing styles. This study aims to perform a systematic comparative evaluation between Convolutional Neural Network (ResNet-50) and Vision Transformer (ViT-B/16) architectures in handling these challenges under limited data scenarios. The research utilizes a specific subset of the CASIA-HWDB1.1 dataset, consisting of 10 fine-grained character classes selected for their visual similarity. To ensure robust evaluation a 5-Fold Cross-Validation method was employed across five experimental scenarios (Baseline, Geometric, Elastic Transform, Random Erasing, and CutMix). Experimental results demonstrate that ResNet-50 consistently outperformed ViT-B/16 in terms of accuracy and stability. The highest performance was achieved by ResNet-50 using Elastic Transform augmentation with a test accuracy of 90.52%, whereas the best performance for ViT-B/16 was achieved using Random Erasing at 88.69%. The study also reveals that ViT exhibits higher performance variability compared to CNN. These findings conclude that for HCCR tasks with limited data, CNN possess a superior inductive bias for capturing local stroke features while data augmentation must be tailored specifically to the architecture type to maximize performance.

Downloads

Download data is not yet available.

Article Details

How to Cite
Prasetyo, S. E., Haeruddin, & Lau, W. (2026). Analisis Komparatif Arsitektur CNN dan ViT dengan Variasi Augmentasi untuk Klasifikasi Karakter Mandarin. Bitnet: Jurnal Pendidikan Teknologi Informasi, 11(1). https://doi.org/10.33084/bitnet.v11i1.11930
Section
Articles

References

Ahn, J., Jang, T., Fengnyu, Q., Lee, H., Lee, J., & Lucia, S. (n.d.). Enhancement of text recognition for hanja handwritten documents of Ancient Korea 1 Introduction.

Baek, S., Park, J., Vepakomma, P., Raskar, R., Bennis, M., & Kim, S. (2022). Visual Transformer Meets CutMix for Improved Accuracy, Communication Efficiency, and Data Privacy in Split Learning.

Bai, Y., Mei, J., Yuille, A., & Xie, C. (2021). Are Transformers More Robust Than CNNs ? NeurIPS, 1–13.

Christian, Y., Wibowo, T., & Lyawati, M. (2024). Sentiment Analysis by Using Naïve Bayes Classification and Support Vector Machine ,. 8(1), 258–275.

de Sousa Neto, A. F., Bezerra, B. L. D., de Moura, G. C. D., & Toselli, A. H. (2024). Data Augmentation for Offline Handwritten Text Recognition: A Systematic Literature Review. SN Computer Science, 5(2), 1–20. https://doi.org/10.1007/s42979-023-02583-6

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). an Image Is Worth 16X16 Words: Transformers for Image Recognition At Scale. ICLR 2021 - 9th International Conference on Learning Representations.

Filipiuk, M., & Singh, V. (2019). Comparing Vision Transformers and Convolutional Nets for Safety Critical Systems.

Franche-comte, B., & Franche-comte, B. (n.d.). Convolutional Neural Network ( CNN ) vs Vision Transformer ( ViT ) for Digital Holography.

Galván, A., Higuero, M., Atutxa, A., Jacob, E., & Saavedra, M. (2025). Comparing CNN and ViT for Open-Set Face Recognition †. 1–19.

Gan, J., Wang, W., & Lu, K. (2020). Characters as Graphs: Recognizing Online Handwritten Chinese Characters via Spatial Graph Convolutional Network. 1. http://arxiv.org/abs/2004.09412

Gui, D., Chen, K., Ding, H., & Huo, Q. (2023). Zero-shot Generation of Training Data with Denoising Diffusion Probabilistic Model for Handwritten Chinese Character Recognition. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 14188 LNCS, 348–365. https://doi.org/10.1007/978-3-031-41679-8_20

Haeruddin, Herman, & Hendri, P. P. (2023). Jurnal Teknologi Terpadu PENGEMBANGAN APLIKASI EMOTION RECOGNITION DAN FACIAL RECOGNITION MENGGUNAKAN ALGORITMA LOCAL BINARY PATTERN. 9(1), 49–55.

Husen, N., Mulugeta, F., Habtamu, B., & Choe, S. (2023). Vision-Transformer-Based Transfer Learning for Mammogram Classification.

Jahja, H. D., Yudistira, N., & Sutrisno. (2023). Mask usage recognition using vision transformer with transfer learning and data augmentation. Intelligent Systems with Applications, 17(November 2022), 200186. https://doi.org/10.1016/j.iswa.2023.200186

Lee, S., Lee, S., Song, B. C., & Member, S. (2022). Improving Vision Transformers to Learn Small-Size Dataset From Scratch. IEEE Access, 10(September), 123212–123224. https://doi.org/10.1109/ACCESS.2022.3224044

Liu, L., Lin, K., Huang, S., Li, Z., Li, C., Cao, Y., & Zhou, Q. (2022). Instance Segmentation for Chinese Character Stroke Extraction, Datasets and Benchmarks. http://arxiv.org/abs/2210.13826

Meng, Y., Wu, W., Wang, F., Li, X., Nie, P., Yin, F., Li, M., Han, Q., Sun, X., & Li, J. (2019). Glyce: Glyph-vectors for Chinese character representations. Advances in Neural Information Processing Systems, 32(NeurIPS), 1–12.

Nanni, L., Paci, M., Brahnam, S., & Lumini, A. (2022). Feature transforms for image data augmentation. Neural Computing and Applications, 34(24), 22345–22356. https://doi.org/10.1007/s00521-022-07645-z

R, R. S., Sungheetha, A., Tiwari, M., & Pindoo, I. A. (2025). Comparative Analysis of Vision Transformer and CNN Architectures in Medical Image Classification. Icsice 24.

Sim, J. H., & Yulianto, A. (2024). Evaluating YOLOv5 and YOLOv8 : Advancements in Human Detection. 6(4), 2999–3015. https://doi.org/10.51519/journalisi.v6i4.944

Solak, A. (2024). A Comparative Analysis of Vision Transformers and Transfer Learning for Brain Tumor Classification. 13, 558–572. https://doi.org/10.29130/dubited.1521340

Umakantha, A., Semedo, J. D., Golestaneh, S. A., & Lin, W.-Y. S. (2021). How to augment your ViTs? Consistency loss and StyleAug, a random style transfer augmentation. http://arxiv.org/abs/2112.09260

Yan, E., & Huang, Y. (2021). Do CNNs Encode Data Augmentations? Proceedings of the International Joint Conference on Neural Networks, 2021-July. https://doi.org/10.1109/IJCNN52387.2021.9534219

Zhong, Z., Zheng, L., Kang, G., Li, S., & Yang, Y. (2017). Random Erasing Data Augmentation.