A Practical Exploration of Vision Transformers for Classification and Object Detection
Main Article Content
Abstract
Vision Transformers (ViTs) have outperformed traditional Convolutional Neural Networks (CNNs) by integrating global context information into images with self-attention mechanisms. Thus, this study proceeds to systematically determine applications of ViTs in image classification and object detection. It methodically evaluates the efficacy of ViTs concerning tokenization, self-attention, and feedforward layers in real-world scenarios. In data augmentation, hyperparameter tuning, and proprietary datasets, various strategies were implemented that enhance model generalization. Furthermore, attention map visualizations increase interpretability and provide insights into model predictions. The study accentuates ViTs' potential for improved classification accuracy, superior object detection, and explainable AI in deep learning applications.