Paper 2

An Uncoupled Data Process and Transfer Model for MapReduce

Authors: Li Zha, Jie Zhang, Wei Liu, and Jian Lin

Volume 17 (2015)

Abstract

In the original MapReduce model, reduce tasks need to fetch output data of map tasks in the manner of \pull”. However, reduce tasks which are occupying reduce slots cannot start executing until all the cor- responding map tasks are completed. It forms the dependence between map and reduce tasks, which is called the coupled relationship in this paper. The coupled relationship leads to two problems: reduce slot hoard- ing and underutilized network bandwidth. Meanwhile, storing the result data is costly especially when the system has replications, which leads to the inecient storage problem. We propose an uncoupled data process and transfer model in order to address these problems. Four core tech- niques, including weighted mapping, data pushing, partial data backup, and data compression are introduced and applied in Apache Hadoop, the mainstream open-source implementation of MapReduce model. This work has been practiced in Baidu, the biggest search engine company in China. A real-world application for web data processing shows that our model can improve the system throughput by 29.5%, reduce the total wall time by 22.8%, provide a weighted wall time acceleration of 26.3%, and reduce the result data stored in disk by 70%. What’s more, the im- plementation of this model is transparent to users and compatible with the original Hadoop.