Academy & Industry Research Collaboration Center (AIRCC)

Volume 12, Number 07, April 2022

Multilingual Speech Recognition Methods using Deep Learning and Cosine Similarity


P Deepak Reddy, Chirag Rudresh and Adithya A S, PES University, India


The paper includes research on discovering new methods for multilingual speech recognition and comparing the effectiveness of the existing solutions with the proposed novelty approaches. The audio and textual multilingual dataset contains multilingual sentences where each sentence contains words from two different languages - English and Kannada.

Our proposed speech recognition process includes preprocessing and splitting each audio sentence based on words, which is then given as input to the DL translator (using MFCC features) along with next word predictions. The use of a Next Word Prediction model along with the DL translator to accurately identify the words and convert to text. Similarly the other approach proposed is the use of cosine similarity where the speech recognition is based on the similarity between word uttered and the generated training dataset. Our models were trained on an audio and textual dataset that were generated by the team members and the test accuracies were measured based on the same dataset.

The accuracy of our speech recognition model, using the novelty method, is 71%. This is a considerably good result compared to the existing multilingual translation solutions.

Communication gap has been a major issue for many natives and locals trying to learn or move ahead in this tech-savvy English-speaking world. To communicate effectively, it is not only essential to have a single language translator but also a tool that can help understand a mixture of different languages to bridge the gap of communication with the non-English speaking communities. Integrating a multilingual translator with the power of a smart phone voice assistant can help aid this process.


Natural Language Processing, Deep Learning, Multilingual Speech Recognition, Machine Learning, Speech to Text.