N-Gram-Based Serbian Text Classification

doi:10.5121/csit.2023.131613

N-Gram-Based Serbian Text Classification

Authors

Petar Prvulović¹, Nemanja Radosavljević¹, Dušan Vujošević¹, Dhinaharan Nagamalai², Jelena Vasiljević¹, ¹Union University, Serbia, ²Wireilla, Australia

Abstract

Natural language processing is an active area of research which finds many applications in variety of fields. Low-resource languages are a challenge as they lack curated datasets, stemmers and other elements used in text processing. Statistical approach is an alternative which can be used to bypass lack of rule-based implementations. The paper presents a model for classification of unstructured text in Serbian language. The model uses n-gram-based stemming to create document attributes vectors. Vectors are created on 3-, 4- and 5-grams. Vector reduction is tested on two criteria: n-gram entropy and number of occurrences, and two lengths: 1000 and 2000 n-grams. The support vector machine is used to classify documents. The model is trained and tested on a dataset collected from a Serbian news portal. Classification accuracy of over 80% is achieved. The presented model provides a good basis for range of applications in business decision automation for low-resource languages.

Keywords

N-gram stemming, Serbian language, Unstructured natural text categorization

AIRCC

N-Gram-Based Serbian Text Classification

Authors

Abstract

Keywords