A Practical Exploration of Vision Transformers for Classification and Object Detection

Aniket Dhage

PDF

Published: Jul 6, 2025

Aniket Dhage, Ayush Purandare, Sanika Butle, Aniket Patil, Kishan Chouhan, Laxmi Bewoor

Abstract

Vision Transformers (ViTs) have outperformed traditional Convolutional Neural Networks (CNNs) by integrating global context information into images with self-attention mechanisms. Thus, this study proceeds to systematically determine applications of ViTs in image classification and object detection. It methodically evaluates the efficacy of ViTs concerning tokenization, self-attention, and feedforward layers in real-world scenarios. In data augmentation, hyperparameter tuning, and proprietary datasets, various strategies were implemented that enhance model generalization. Furthermore, attention map visualizations increase interpretability and provide insights into model predictions. The study accentuates ViTs' potential for improved classification accuracy, superior object detection, and explainable AI in deep learning applications.

Issue

Vol. 46 No. 07 (2025): Issue 07

Section

Articles

Article Sidebar

Main Article Content

Abstract

Article Details