Volume 13, Number 5/6
Data-Driven Part-of-Speech Tagging for the Gikuyu Language: Development, Challenges, and Prospects
Authors
Gabriel Kamau, Dedan Kimathi University of Technology, Kenya
Abstract
This paper presents the development of a data-driven Part-of-Speech (POS) tagger for Gikuyu, a Bantu language spoken in Kenya. Gikuyu, like many indigenous African languages, is under-resourced, with limited computational tools for linguistic processing. By employing a corpus sourced primarily from the Gikuyu Bible and leveraging a Memory-Based Tagging (MBT) approach, this study demonstrates the feasibility of creating a robust POS tagging system. The tagger achieved a precision of 90.44%, a recall of 88.34%, and an F-score of 91.35%. These results underscore its potential for applications in machine translation, speech recognition, and language preservation. The study highlights the challenges of working with under-resourced languages, including data collection and annotation, and provides recommendations for future work, including integration with broader NLP tasks.
Keywords
Natural Language Processing, Part-of-Speech Tagging, Gikuyu Language, Data-Driven Approach, Low-Resource Languages.