A Thorough Introduction to Multimodal Machine Translation

Volume 16, Number 3

A Thorough Introduction to Multimodal Machine Translation

Authors

Kouassi Konan Jean-Claude, Bircham International University, Spain

Abstract

Five years before the release of ChatGPT, the world of Machine Translation (MT) was dominated by unimodal AI implementations, generally bilingual or multilingual AI models with only text modality. The era of Large Language Models (LLMs) led to various multimodal translation initiatives with text and image modalities, based on custom data engineering techniques that introduced expectations for improvement in the field of MT when using multimodal options. In our work, we introduced a first of its kind AI multimodal translation with four modalities (text, image, audio and video), from English towards a low resource language and vice-versa. Our results confirmed that multimodal translation generalizes better, always brings improvement to unimodal text translation, and superior performance as the number of unseen samples increases. Moreover, this initiative is a hope for worldwide low resource languages for which the use of non-text modalities is a great solution to data scarcity in the field.

Keywords

Artificial Intelligence (AI), Machine Learning (ML), Multimodal Machine Translation, Data Engineering, Multimodal Dataset, Baoulé language (bci)