Wu Zhang, Miotech, Hong Kong
Duplicated training data usually downgrades machine learning models' performance. This paper presents a practical algorithm for efficiently deduplicating highly similar news articles in large datasets. Our algorithm comprises three components - document embedding, similarity computation, and clustering - each utilizing specific algorithms and tools to optimize both speed and performance. We utilize the Doc2Vec model to generate document embeddings.We employ Faiss for rapid similarity search. To perform clustering, we make use of the disjoint set data structure. We demonstrate the efficacy of our approach by accurately deduplicating over 7 million news articles in less than 4 hours.
Document Embedding, Text Similarity, News Deduplication, Natural language processing