Volume 16, Number 1
AE-ViT: Token Enhancement for Vision Transformers via CNN-Based Autoencoder Ensembles
Authors
Heriniaina Andry RABOANARY, Roland RABOANARY, and Nirina Maurice HASINA TAHIRIDIMBISOA, Universite d'Antananarivo, Madagascar
Abstract
While Vision Transformers (ViTs) have revolutionized computer vision with their exceptional results, they struggle to balance processing speed with visual detail preservation. This tension becomes particularly evident when implementing larger patch sizes. Although larger patches reduce computational costs, they lead to significant information loss during the tokenization process. We present AE-ViT, a novel architecture that leverages an ensemble of autoencoders to address this issue by introducing specialized latent tokens that integrate seamlessly with standard patch tokens, enabling ViTs to capture both global and fine-grained features. Our experiments on CIFAR-100 show that AE-ViT achieves a 23.67% relative accuracy improvement over the baseline ViT when using 16×16 patches, effectively recovering fine-grained details typically lost with larger patches. Notably, AE-ViT maintains relevant performance (60.64%) even at 32×32 patches. We further validate our method on CIFAR-10, confirming consistent benefits and adaptability across different datasets. Ablation studies on ensemble size and integration strategy underscore the robustness of AE-ViT, while computational analysis shows that its efficiency scales favorably with increasing patch size. Overall, these findings suggest that AE-ViT provides a practical solution to the patch-size dilemma in ViTs by striking a balance between accuracy and computational cost, all within a simple, end-to-end trainable design.
Keywords
Vision Transformers, Convolutional Neural Networks, Autoencoders, Hybrid Architecture, Image Classification, latent representation.