News Article Text Classification and Summary for Authors and Topics

Aviel J. Stein; Janith Weerasinghe; Spiros Mancoridis; Rachel Greenstadt; Aviel J. Stein; Janith Weerasinghe; Spiros Mancoridis; Rachel Greenstadt

doi:10.5121/csit.2020.101401

Volume 10, Number 14, November 2020

News Article Text Classification and Summary for Authors and Topics

Authors

Aviel J. Stein¹, Janith Weerasinghe², Spiros Mancoridis¹ and Rachel Greenstadt², ¹Drexel University, USA, ²New York University, USA

Abstract

News articles are important for providing timely, historic information. However, the Internet is replete with text that may contain irrelevant or unhelpful information, therefore means of processing it and distilling content is important and useful to human readers as well as information extracting tools. Some common questions we may want to answer are “what is this article about?” and “who wrote it?”. In this work we compare machine learning models for evaluating two common NLP tasks, topic and authorship attribution, on the 2017 Vox Media dataset. Additionally, we use the models to classify on a subsection, about ~20%, of the original text which show to be better for classification than the provided blurbs. Because of the large number of topics, we take into account topic overlap and address it via top-n accuracy and hierarchical groupings of topics. We also consider edge cases in authorship by classifying on inter-topic and intra-topic author distributions. Our results show that both topics and authors readily identifiable consistently perform best when using neural networks rather than support vector, random forests, or naive Bayes classifiers, although the latter methods perform acceptably.

Keywords

Natural Language Processing, Topic Classification, Author Attribution, Summarization, Machine Learning.

Subscription Membership AIRCC CSCP Contact Us
All Rights Reserved ® AIRCC