Volume 12, Number 4

Variations in Outcome for the Same Map Reduce Transitive Closure Algorithm Implemented
on Different Hadoop Platforms


Purvi Parmar, MaryEtta Morris, John R. Talburt and Huzaifa F. Syed, University of Arkansas, USA


This paper describes the outcome of an attempt to implement the same transitive closure (TC) algorithm for Apache MapReduce running on different Apache Hadoop distributions. Apache MapReduce is a software framework used with Apache Hadoop, which has become the de facto standard platform for processing and storing large amounts of data in a distributed computing environment. The research presented here focuses on the variations observed among the results of an efficient iterative transitive closure algorithm when run against different distributed environments. The results from these comparisons were validated against the benchmark results from OYSTER, an open source Entity Resolution system. The experiment results highlighted the inconsistencies that can occur when using the same codebase with different implementations of Map Reduce.


Entity Resolution, Hadoop, MapReduce, Transitive Closure, HDFS, Cloudera, Talend.