Academy & Industry Research Collaboration Center (AIRCC)

Volume 12, Number 06, March 2022

Meeting Challenges of Modern Standard Arabic and Saudi Dialect Identification

  Authors

Yahya Aseri, Khalid Alreemy, Salem Alelyani, Mohamed Mohana, King Khalid University, Saudi Arabia

  Abstract

Dialect identification is a prior requirement for learning lexical and morphological knowledge a language variation that can be beneficial for natural language processing (NLP) and potential AI downstream tasks. In this paper, we present the first work on sentence-level Modern Standard Arabic (MSA) and Saudi Dialect (SD) identification where we trained and tested three classifiers (Logistic regression, Multi-nominal Na¨ıve Bayes, and Support Vector Machine) on datasets collected from Saudi Twitter and automatically labeled as (MSA) or SD. The model for each configuration was built using two levels of language models, i.e., unigram and bi-gram, as feature sets for training the systems. The model reported high-accuracy performance using 10-fold cross- validations with average 98.98%. This model was evaluated on another unseen, manually-annotated dataset. The best performance of these classifiers was achieved by Multi-nominal Naïve Bayes, reporting 89%.

  Keywords

Dialect Identification, NLP, Standard Arabic, Saudi Dialect, Classification.