Volume 11, Number 4

Categorizing 2019-n-CoV Twitter Hashtag Data by Clustering


Koffka Khan1 and Emilie Ramsahai2, 1The University of the West Indies, Trinidad, 2UWI School of Business & Applied Studies Ltd (UWI-ROYTEC), Trinidad and Tobago


Unsupervised machine learning techniques such as clustering are widely gaining use with the recent increase in social communication platforms like Twitter and Facebook. Clustering enables the finding of patterns in these unstructured datasets. We collected tweets matching hashtags linked to COVID-19 from a Kaggle dataset. We compared the performance of nine clustering algorithms using this dataset. We evaluated the generalizability of these algorithms using a supervised learning model. Finally, using a selected unsupervised learning algorithm we categorized the clusters. The top five categories are Safety, Crime, Products, Countries and Health. This can prove helpful for bodies using large amount of Twitter data needing to quickly find key points in the data before going into further classification.


Unsupervised machine learning, clustering, Twitter, 2019-nCoV, hashtags, Kaggle, supervised, classification.