A Practical Algorithm for Efficiently Deduplicating Highly Similar News in Large News Corpora

doi:10.5121/csit.2023.131214

A Practical Algorithm for Efficiently Deduplicating Highly Similar News in Large News Corpora

Authors

Wu Zhang, Miotech, Hong Kong

Abstract

Duplicated training data usually downgrades machine learning models' performance. This paper presents a practical algorithm for efficiently deduplicating highly similar news articles in large datasets. Our algorithm comprises three components - document embedding, similarity computation, and clustering - each utilizing specific algorithms and tools to optimize both speed and performance. We utilize the Doc2Vec model to generate document embeddings.We employ Faiss for rapid similarity search. To perform clustering, we make use of the disjoint set data structure. We demonstrate the efficacy of our approach by accurately deduplicating over 7 million news articles in less than 4 hours.

Keywords

Document Embedding, Text Similarity, News Deduplication, Natural language processing

AIRCC

A Practical Algorithm for Efficiently Deduplicating Highly Similar News in Large News Corpora

Authors

Abstract

Keywords