Academy & Industry Research Collaboration Center (AIRCC)

Volume 11, Number 14, September 2021

Creating Multi-Scripts Sentiment Analysis Lexicons for Algerian, Moroccan and Tunisian Dialects

  Authors

K. Abidi and K. Smaili, Loria - University Lorraine, France

  Abstract

In this article, we tackle the issue of sentiment analysis in three Maghrebi dialects used in social networks. More precisely, we are interested by analysing sentiments in Algerian, Moroccan and Tunisian corpora. To do this, we built automatically three lexicons of sentiments, one for each dialect. Each lexicon is composed of words with their polarities, a dialect word could be written in Arabic or in Latin scripts. These lexicons may include French or English words as well as words in Arabic dialect and standard Arabic. The semantic orientation of a word represented by an embedding vector is determined automatically by calculating its distance with several embedding seed words. The embedding vectors are trained on three large corpora collected from YouTube. The proposed approach is evaluated by using few existing annotated corpora in Tunisian and Moroccan dialects. For the Algerian dialect, in addition to a small corpus we found in the literature, we collected and annotated one composed of 10k comments extracted from Youtube. This corpus represents a valuable resource which is proposed for free.

  Keywords

Maghrebi Dialect, Word Embedding, Semantic Orientation.