Academy & Industry Research Collaboration Center (AIRCC)

Volume 11, Number 07, May 2021

A New Hashing based Nearest Neighbors Selection Technique for Big Datasets

  Authors

Jude Tchaye-Kondi, Yanlong Zhai and Liehuang Zhu, Beijing Institute of Technology, China

  Abstract

KNN has the reputation of being a simple and powerful supervised learning algorithm used for either classification or regression. Although KNN prediction performance highly depends on the size of the training dataset, when this one is large, KNN suffers from slow decision making. This is because each decision-making process requires the KNN algorithm to look for nearest neighbors within the entire dataset. To overcome this slowness problem, we propose a new technique that enables the selection of nearest neighbors directly in the neighborhood of a given data point. The proposed approach consists of dividing the data space into sub-cells of a virtual grid built on top of the dataset. The mapping between data points and sub-cells is achieved using hashing. When it comes to selecting the nearest neighbors of a new observation, we first identify the central cell where the observation is contained. Once that central cell is known, we then start looking for the nearest neighbors from it and the cells around. From our experimental performance analysis of publicly available datasets, our algorithm outperforms the original KNN with a predictive quality as good and offers competitive performance with solutions such as KDtree.

  Keywords

Machine learning, Nearest neighbors, Hashing, Big data.