×
From Voice to Code: A Rag-Enhanced Pipeline for Robust Multi-Accent Order Processing

Authors

Amirmohammad Erfan1, Taha Khan1, Pelin Angin Ulkuer and Merih Angin2, 1Middle East Technical University, Turkey, 2Koc University, Turkey

Abstract

The rapid evolution of large language models presents a significant new opportunity for human- AI interaction, particularly through the use of automatic speech recognition (ASR). Despite the advances in ASR, challenges including accent differences, noisy environments and diverse speech patterns hinder achieving high accuracy in certain tasks like spoken order processing in restaurants. This paper introduces and assesses a complete pipeline designed to transcribe and structure multi-accent spoken orders into JSON, maintaining performance even in noisy settings. Our system integrates the Whisper ASR model for voice transcription with two instruction-tuned language models, FLAN-T5 and Gemma-3, for textto- JSON conversion. To train and test these models, we created a large-scale, diverse dataset of spoken orders featuring multiple accents and various background noises. We investigate a Retrieval-Augmented Generation (RAG) approach to enhance JSON conversion accuracy by providing the models with relevant menu context during inference. We evaluate the full pipeline on both clean and noisy audio, comparing the effectiveness of fine-tuned FLAN-T5 and Gemma-3 with and without RAG. Furthermore, we assess the models’ generalization capabilities on orders of varying complexity and their robustness against diverse speech patterns. Our results demonstrate that the proposed pipelines achieve high accuracy, with the RAGenhanced approach significantly improving the performance of smaller models, thereby offering a practical and efficient solution for automated order processing.

Keywords

Voice-to-JSON, Whisper, FLAN-T5, Gemma-3, Retrieval-Augmented Generation (RAG), Automatic Speech Recognition (ASR), Fine-tuning, Order Processing.