Academy & Industry Research Collaboration Center (AIRCC)

Volume 11, Number 03, March 2021

Investigating Data Sharing in Speech Recognition for an Under-Resourced
Language: The Case of Algerian Dialect

  Authors

Mohamed Amine Menacer and Kamel Smaïli, Université de Lorraine, France

  Abstract

The Arabic language has many varieties, including its standard form, Modern Standard Arabic (MSA), and its spoken forms, namely the dialects. Those dialects are representative examples of under-resourced languages for which automatic speech recognition is considered as an unresolved issue. To address this issue, we recorded several hours of spoken Algerian dialect and used them to train a baseline model. This model was boosted afterwards by taking advantage of other languages that impact this dialect by integrating their data in one large corpus and by investigating three approaches: multilingual training, multitask learning and transfer learning. The best performance was achieved using a limited and balanced amount of acoustic data from each additional language, as compared to the data size of the studied dialect. This approach led to an improvement of 3.8% in terms of word error rate in comparison to the baseline system trained only on the dialect data.

  Keywords

Automatic speech recognition, Algerian dialect, MSA, multilingual training, multitask learning, transfer learning.