Performing timely analysis on huge datasets is the central promise of big data analytics. To cope with the high volumes of data to be analyzed, computation frameworks have resorted to “scaling out” – parallelization of analytics that allows for seamless execution across large clusters. These frameworks automatically compose analytics jobs into a DAG of small tasks, and then aggregate the intermediate results from the tasks to obtain the final result. Their ability to do so relies on an efficient scheduler and a reliable storage layer that distributes the datasets on different machines.
In this chapter, we survey the above two aspects, scheduling and storage, which are the foundations of modern big data analytics systems.We describe their key principles, and how these principles are realized in widely deployed systems.
Analyzing large volumes of data has become the major source for innovation behind large Internet services as well as scientific applications. Examples of such “big data analytics” occur in personalized recommendation systems, online social networks, genomic analyses, and legal investigations for fraud detection. A key property of the algorithms employed for such analyses is that they provide better results with increasing amount of data processed. In fact, in certain domains (like search) there is a trend towards using relatively simpler algorithms and instead relying on more data to produce better results.
While the amount of data to be analyzed increases on the one hand, the acceptable time to produce results is shrinking on the other hand. Timely analyses have significant ramifications for revenue as well as productivity. Low latency results in online services leads to improved user satisfaction and revenue. Ability to crunch large datasets in short periods results in faster iterations and progress in scientific theories.
To cope with the dichotomy of ever-growing datasets and shrinking times to analyze them, analytics clusters have resorted to scaling out. Data are spread across many different machines, and the computations on them are executed in parallel. Such scaling out is crucial for fast analytics and allows coping with the trend of datasets growing faster than Moore's laws increase in processor speeds.
Many data analytics frameworks have been built for large scale-out parallel executions. Some of the widely used frameworks are MapReduce , Dryad  and Apache Yarn .