Academy & Industry Research Collaboration Center (AIRCC)

Volume 12, Number 22, December 2022

Reduce++: Unsupervised Content-Based Approach for Duplicate Result Detection in Search Engines

  Authors

Zahraa Chreim1, Hussein Hazimeh1, 2, Hassan Harb1, Fouad Hannoun2, Karl Daher2, Elena Mugellini2 and Omar Abou Khaled2, 1Lebanese University, Lebanon, 2University of Applied Sciences of Western Switzerland, Switzerland

  Abstract

Search engines are among the most popular web services on the World Wide Web. They facilitate the process of finding information using a query-result mechanism. However, results returned by search engines contain many duplications. In this paper, we introduce a new content-type-based similarity computation method to address this problem. Our approach divides the webpage into different types of content, such as title, subtitles, body, etc. Then, we find for each type a suitable similarity measure. Next, we add the different calculated similarity scores to get the final similarity score between the two documents, using a weighted formula. Finally, we suggest a new graph-based algorithm to cluster search results according to their similarity. We empirically evaluated our results with the Agglomerative Clustering, and we achieved about 61% reduction in web pages, 0.2757 Silhouette coefficient, 0.1269 Davies Bouldin Score, and 85 Calinski Harabasz Score.

  Keywords

Information Retrieval, Websites Similarity, Graph Representation, Similarity Measures, Graph Kernel, Deduplication, Search Engines.