While the performance of a data parallelism structure is primarily determined by the amount of data that need to be processed, since data are processed by different machines in parallel, the completion time of an application job is determined by the last finished machine. It is a grand challenge to distribute data load evenly to different machines. In the default configuration ofMapReduce Hadoop, this is done by developing a hash function in the shuffling phase so that the intermediate results outputted by the mapper phase will be distributed evenly to different machines in the reduce phase. Such default configuration emphasizes the “common” case. Clearly, it is not optimal for various scenarios, and in many situations such configuration performs poorly.
There are many research studies on this grand problem. There are the reactive approach and the proactive approach. In the reactive approach, the workloads of different machines are monitored and workloads can be migrated from one machine to another in case there is a large skew of the workloads. SkewTune is one good example representing reactive approach . In the proactive approach, systems try to understand the possible workloads the data to be processed can generate on each machine, so that when the data are dispatched to different machines in a load aware manner.
In this chapter, we study a proactive approach to improve performance of processing large data in MapReduce using sublinear algorithms.
The Data Skew Problem of MapReduce Jobs
When running MapReduce jobs, most of the studies have assumed that the input data are of uniform distribution, which, often being hashed to reduce worker nodes, naturally leads to a desirable balanced load in the later stages. The real-world data, however, are not necessarily uniform, and often exhibit a remarkable degree of skewness. For example, in PageRank, the graph commonly includes nodes with much higher degrees of incoming edges than others , and in Inverted Index, certain content can appear in many more documents than others . Such a skewed distribution of the input or intermediate data can result in a small number of mappers or reducers taking a significantly longer time to complete than others . Recent experimental studies [21, 19, 22] have shown that, in the CloudBurst application with a biology data set of a bimodal distribution , the slowest map task takes five times as long to complete as the fastest.