Academy & Industry Research Collaboration Center (AIRCC)

Volume 12, Number 23, December 2022

Language-Agnostic Text Processing for Information Extraction

  Authors

Karthika Vijayan and Oshin Anand, Sahaj AI, Bangalore, India

  Abstract

Information extraction from multilingual text for conversational AI generally implements natural language understanding (NLU) using multiple language-specific models, which may not be available for low resource languages or code mixed scenarios. In this paper, we study the implementation of multilingual NLU by development of a language agnostic processing pipeline. We perform this study using the case of a conversational assistant, built using the RASA framework. The automatic assistants for answering text queries are built in different languages and code mixing of languages, while doing so, experimentation with different components in an NLU pipeline is conducted. Sparse and dense feature extraction accomplishes the language agnostic composite featurization of text in the pipeline. We perform experiments with intent classification and entity extraction as part of information extraction. The efficacy of the language agnostic NLU pipeline is showcased when (i) dedicated language models are not available for all languages of our interest, and (ii) in case of code mixing. Our experiments delivered accuracies in intent classification of 98.49%, 96.41% and 97.98% for same queries in English, Hindi and Malayalam languages, respectively, without any dedicated language models.

  Keywords

Information Extraction, Multilingual Text, Natural Language Understanding, Language Agnostic Processing, Composite features.