Academy & Industry Research Collaboration Center (AIRCC)

Volume 12, Number 13, July 2022

A Comparison between Vgg16 and Xception Models used as Encoders for Image Captioning


Asrar Almogbil, Amjad Alghamdi, Arwa Alsahli, Jawaher Alotaibi, Razan Alajlan and Fadiah Alghamdi, Imam Abdulrahman Bin Faisal University, Saudi Arabia


Image captioning is an intriguing topic in Natural Language Processing (NLP) and Computer Vision (CV). The present state of image captioning models allows it to be utilized for valuable tasks, but it demands a lot of computational power and storage memory space. Despite this problem's importance, only a few studies have looked into models’ comparison in order to prepare them for use on mobile devices. Furthermore, most of these studies focus on the decoder part in an encoder-decoder architecture, usually the encoder takes up the majority of the space. This study provides a brief overview of image captioning advancements over the last five years and illustrate the prevalent techniques in image captioning and summarize the results. This research study also discussed the commonly used models, the VGG16 and Xception, while using the Long short-term memory (LSTM) for the text generation. Further, the study was conducted on the Flickr8k dataset.


Image Captioning, Encoder-Decoder Framework, VGG16, Xception, LSTM.