Jason S. Chu1 and Sindhu Ghanta2, 1USA, 2AIClub, USA
Exploring the area of multimodal sentiment analysis, this paper addresses the growing significance of this field, driven by the exponential rise in multimodal data across platforms like YouTube. Traditional sentiment analysis, primarily focused on textual data, often overlooks the complexities and nuances of human emotions conveyed through audio and visual cues. Addressing this gap, our study explores a comprehensive approach that integrates data from text, audio, and images, applying state-of-the-art machine learning and deep learning techniques tailored to each modality. Our methodology is tested on the CMU-MOSEI dataset, a multimodal collection from YouTube, offering a diverse range of human sentiments. Our research highlights the limitations of conventional text-based sentiment analysis, especially in the context of the intricate expressions of sentiment that multimodal data encapsulates. By fusing audio and visual information with textual analysis, we aim to capture a more complete spectrum of human emotions. Our experimental results demonstrate notable improvements in precision, recall and accuracy for emotion prediction, validating the efficacy of our multimodal approach over single-modality methods. This study not only contributes to the ongoing advancements in sentiment analysis but also underscores the potential of multimodal approaches in providing more accurate and nuanced interpretations of human emotions.
Transformers, multi-modal, sentiment analysis.