Volume 12, Number 4

Variations in Outcome for the Same Map Reduce Transitive Closure Algorithm Implemented
on Different Hadoop Platforms

  Authors

Purvi Parmar, MaryEtta Morris, John R. Talburt and Huzaifa F. Syed, University of Arkansas, USA

  Abstract

This paper describes the outcome of an attempt to implement the same transitive closure (TC) algorithm for Apache MapReduce running on different Apache Hadoop distributions. Apache MapReduce is a software framework used with Apache Hadoop, which has become the de facto standard platform for processing and storing large amounts of data in a distributed computing environment. The research presented here focuses on the variations observed among the results of an efficient iterative transitive closure algorithm when run against different distributed environments. The results from these comparisons were validated against the benchmark results from OYSTER, an open source Entity Resolution system. The experiment results highlighted the inconsistencies that can occur when using the same codebase with different implementations of Map Reduce.

  Keywords

Entity Resolution, Hadoop, MapReduce, Transitive Closure, HDFS, Cloudera, Talend.