Volume 12, Number 3

A Study on the Appropriate Size of the Mongolian General Corpus

  Authors

Choi Sun Soo1 and Ganbat Tsend2, 1University of the Humanities, Mongolia, 2Otgontenger University, Mongolia

  Abstract

This study aims to determine the appropriate size of the Mongolian general corpus. This study used the Heaps’ function and Type-Token Ratio (TTR) to determine the appropriate size of the Mongolian general corpus. This study’s sample corpus of 906,064 tokens comprised texts from 10 domains of newspaper politics, economy, society, culture, sports, world articles and laws, middle and high school literature textbooks, interview articles, and podcast transcripts. First, we estimated the Heaps’ function with this sample corpus. Next, we observed changes in the number of types and TTR values while increasing the number of tokens by one million using the estimated Heaps’ function. As a result of observation, we found that the TTR value hardly changed when the number of tokens exceeded 39~42 million. Thus, we conclude that an appropriate size for a Mongolian general corpus is 39-42 million tokens.

  Keywords

Mongolian general corpus, Appropriate size of corpus, Sample corpus, Heaps’ function, TTR, Type, Token.