Paper 5

Performance Analysis of Adapting a MapReduce Framework to Dynamically Accommodate Heterogeneity

Authors: Jessica Hartog and Renan DelValle and Madhusudhan Govindaraju and Michael J. Lewis

Volume 20 (2015)

Abstract

When data centers employ the common and economical prac- tice of upgrading subsets of nodes incrementally, rather than replacing or upgrading all nodes at once, they end up with clusters whose nodes have non-uniform processing capability, which we also call performance- heterogeneity. Popular frameworks supporting the effective MapReduce programming model for Big Data applications do not flexibly adapt to these environments. Instead, existing MapReduce frameworks, including Hadoop, typically divide data evenly among worker nodes, thereby induc- ing the well-known problem of stragglers on slower nodes. Our alternative MapReduce framework, called MARLA, divides each worker’s labor into sub-tasks, delays the binding of data to worker processes, and thereby enables applications to run faster in performance-heterogeneous environ- ments. This approach does introduce overhead, however. We explore and characterize the opportunity for performance gains, and identify when the benefits outweigh the costs. Our results suggest that frameworks should support finer grained sub-tasking and dynamic data partitioning when running on some performance-heterogeneous clusters. Blindly tak- ing this approach in homogeneous clusters can slow applications down. Our study further suggests the opportunity for cluster managers to build performance-heterogeneous clusters by design, if they also run MapRe- duce frameworks that can exploit them.