Umar Jamil, University of Leeds, UK
Lip-reading, the process of deciphering text from visual mouth movements, has garnered significant research attention. While numerous data sets exist for training lip-reading models, their coverage of diverse languages remains limited. In this paper, we introduce an innovative pipeline for constructing data sets tailored to lipreading models, leveraging web-based videos. Notably, this pipeline is the first of its kind to be made publicly available. By employing this pipeline, we successfully compiled a data set comprising Italian videos—a previously unexplored language for lipreading research. Subsequently, we utilized this data set to train two lip-reading models, thereby highlighting the strengths and weaknesses of employing wild-sourced videos (e.g., from YouTube) for lip-reading model training. The proposed pipeline encompasses modules for audio-video synchronization, audio transcription, alignment, cleaning, and facilitates the creation of extensive training data with minimal supervision. By presenting this pipeline, we aim to encourage further advancements in lip-reading research, specifically in the domain of multilingual data sets, thus fostering more comprehensive and inclusive lip-reading models.
Deep Learning, Lip Reading, Visual Speech Recognition, Datasets