Reduce++: Unsupervised Content-Based Approach for Duplicate Result Detection in Search Engines

doi:10.5121/csit.2022.122211

Volume 12, Number 22, December 2022

Reduce++: Unsupervised Content-Based Approach for Duplicate Result Detection in Search Engines

Authors

Zahraa Chreim¹, Hussein Hazimeh^{1, 2}, Hassan Harb¹, Fouad Hannoun², Karl Daher², Elena Mugellini² and Omar Abou Khaled², ¹Lebanese University, Lebanon, ²University of Applied Sciences of Western Switzerland, Switzerland

Abstract

Search engines are among the most popular web services on the World Wide Web. They facilitate the process of finding information using a query-result mechanism. However, results returned by search engines contain many duplications. In this paper, we introduce a new content-type-based similarity computation method to address this problem. Our approach divides the webpage into different types of content, such as title, subtitles, body, etc. Then, we find for each type a suitable similarity measure. Next, we add the different calculated similarity scores to get the final similarity score between the two documents, using a weighted formula. Finally, we suggest a new graph-based algorithm to cluster search results according to their similarity. We empirically evaluated our results with the Agglomerative Clustering, and we achieved about 61% reduction in web pages, 0.2757 Silhouette coefficient, 0.1269 Davies Bouldin Score, and 85 Calinski Harabasz Score.

Keywords

Information Retrieval, Websites Similarity, Graph Representation, Similarity Measures, Graph Kernel, Deduplication, Search Engines.

Subscription Membership AIRCC CSCP Contact Us
All Rights Reserved ® AIRCC