Big data analytics systems

Ganesh Ananthanarayanan; Ishai Menache

doi:10.1017/CBO9781316162750.006

5 - Big data analytics systems

from Part II - Big data over cyber networks

Published online by Cambridge University Press: 18 December 2015

Ganesh Ananthanarayanan and

Ishai Menache

Edited by

Shuguang Cui ,

Alfred O. Hero, III ,

Zhi-Quan Luo and

José M. F. Moura

Show author details

Ganesh Ananthanarayanan: Affiliation:
Microsoft Research, USA
Ishai Menache: Affiliation:
Microsoft Research, USA
Shuguang Cui: Affiliation:
Texas A & M University
Alfred O. Hero, III: Affiliation:
University of Michigan, Ann Arbor
Zhi-Quan Luo: Affiliation:
University of Minnesota
José M. F. Moura: Affiliation:
Carnegie Mellon University, Pennsylvania

Book contents

Get access

Summary

Performing timely analysis on huge datasets is the central promise of big data analytics. To cope with the high volumes of data to be analyzed, computation frameworks have resorted to “scaling out” – parallelization of analytics that allows for seamless execution across large clusters. These frameworks automatically compose analytics jobs into a DAG of small tasks, and then aggregate the intermediate results from the tasks to obtain the final result. Their ability to do so relies on an efficient scheduler and a reliable storage layer that distributes the datasets on different machines.

In this chapter, we survey the above two aspects, scheduling and storage, which are the foundations of modern big data analytics systems.We describe their key principles, and how these principles are realized in widely deployed systems.

Introduction

Analyzing large volumes of data has become the major source for innovation behind large Internet services as well as scientific applications. Examples of such “big data analytics” occur in personalized recommendation systems, online social networks, genomic analyses, and legal investigations for fraud detection. A key property of the algorithms employed for such analyses is that they provide better results with increasing amount of data processed. In fact, in certain domains (like search) there is a trend towards using relatively simpler algorithms and instead relying on more data to produce better results.

While the amount of data to be analyzed increases on the one hand, the acceptable time to produce results is shrinking on the other hand. Timely analyses have significant ramifications for revenue as well as productivity. Low latency results in online services leads to improved user satisfaction and revenue. Ability to crunch large datasets in short periods results in faster iterations and progress in scientific theories.

To cope with the dichotomy of ever-growing datasets and shrinking times to analyze them, analytics clusters have resorted to scaling out. Data are spread across many different machines, and the computations on them are executed in parallel. Such scaling out is crucial for fast analytics and allows coping with the trend of datasets growing faster than Moore's laws increase in processor speeds.

Many data analytics frameworks have been built for large scale-out parallel executions. Some of the widely used frameworks are MapReduce [1], Dryad [2] and Apache Yarn [3].

Type: Chapter
Information: Big Data over Networks , pp. 137 - 160

DOI: https://doi.org/10.1017/CBO9781316162750.006 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2016

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

[1] J., Dean and S., Ghemawat, “Mapreduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.Google Scholar

[2] M., Isard, M., Budiu, Y., Yu, A., Birrell, and D., Fetterly, “Dryad: distributed data-parallel programs from sequential building blocks,” in ACM EuroSys, 2007.Google Scholar

[3] V., Vavilapalli et al., “Apache hadoop yarn: yet another resource negotiator,” in ACM SoCC, 2013.Google Scholar

[4] B., Hindman, A., Konwinski, M., Zaharia, et al., “Mesos: a platform for fine-grained resource sharing in the data center,” in USENIX NSDI, 2011.Google Scholar

[5] M., Shreedhar and G., Varghese, “Efficient fair queuing using deficit round-robin,” IEEE/ACM Transactions on Networking, vol. 4, no. 3, pp. 375–385, 1996.Google Scholar

[6] A., Demers, S., Keshav, and S., Shenker, “Analysis and simulation of a fair queueing algorithm,” in ACM SIGCOMM Computer Communication Review, vol. 19, no. 4. ACM, 1989, pp. 1–12.Google Scholar

[7] A., Ghodsi, M., Zaharia, B., Hindman, et al., “Dominant resource fairness: fair allocation of multiple resource types.” in NSDI, vol. 11, 2011, pp. 24–24.Google Scholar

[8] M., Zaharia, D., Borthakur, J. S., Sarma, et al., “Job scheduling for multi-user mapreduce clusters,” EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS- 2009-55, 2009.

[9] R., Grandl, G., Ananthanarayanan, S., Kandula, S., Rao, and A., Akella, “Multi-resource packing for cluster schedulers,” in ACMSIGCOMM, 2014, pp. 455–466. [Online]. Available: http://doi.acm.org/10.1145/2619239.2626334.Google Scholar

[10] A., Ghodsi,M., Zaharia, S., Shenker, and I., Stoica, “Choosy: max-min fair sharing for datacenter jobs with constraints,” in Proceedings of the 8th ACMEuropean Conference on Computer Systems, ACM, 2013, pp. 365–378.Google Scholar

[11] B., Sharma, V., Chudnovsky, J. L., Hellerstein, R., Rifaat, and C. R., Das, “Modeling and synthesizing task placement constraints in google compute clusters,” in Proceedings of the 2nd ACM Symposium on Cloud Computing, ACM, 2011, p. 3.Google Scholar

[12] M., Zaharia, D., Borthakur, J. Sen, Sarma, et al., “Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling,” in Proceedings of the 5th European Conference on Computer Systems, ACM, 2010, pp. 265–278.Google Scholar

[13] M., Isard, V., Prabhakaran, J., Currey, U., Wieder, K., Talwar, and A., Goldberg, “Quincy: fair scheduling for distributed computing clusters,” in Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles, ACM, 2009, pp. 261–276.Google Scholar

[14] P., Bodík, I., Menache, M., Chowdhury, et al., “Surviving failures in bandwidth-constrained datacenters,” in Proceedings of the ACMSIGCOMM 2012 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication, ACM, 2012, pp. 431– 442.Google Scholar

[15] A. D., Ferguson, P., Bodik, S., Kandula, E., Boutin, and R., Fonseca, “Jockey: guaranteed job latency in data parallel clusters,” in Proceedings of the 7th ACM european conference on Computer Systems, ACM, 2012, pp. 99–112.Google Scholar

[16] C., Curino, D. E., Difallah, C., Douglas, et al., “Reservation-based scheduling: If you're late don't blame us!” in Proceedings of the ACM Symposium on Cloud Computing, ACM, 2014, pp. 1–14.Google Scholar

[17] N., Jain, I., Menache, J., Naor, and J., Yaniv, “Near-optimal scheduling mechanisms for deadline-sensitive jobs in large computing clusters,” in SPAA, 2012, pp. 255–266.Google Scholar

[18] B., Lucier, I., Menache, J., Naor, and J., Yaniv, “Efficient online scheduling for deadlinesensitive jobs: extended abstract,” in SPAA, 2013, pp. 305–314.Google Scholar

[19] P., Bodík, I., Menache, J. S., Naor, and J., Yaniv, “Brief announcement: deadline-aware scheduling of big-data processing jobs,” in Proceedings of the 26th ACM symposium on Parallelism in algorithms and architectures, ACM, 2014, pp. 211–213.Google Scholar

[20] M., Zaharia, A., Konwinski, A. D., Joseph, R. H., Katz, and I., Stoica, “Improving mapreduce performance in heterogeneous environments,” in OSDI, vol. 8, no. 4, 2008, p. 7.Google Scholar

[21] G., Ananthanarayanan, S., Kandula, A. G., Greenberg, I., Stoica, Y., Lu, B., Saha, and E., Harris, “Reining in the outliers in map-reduce clusters using mantri,” in OSDI, vol. 10, no. 1, 2010, p. 24.Google Scholar

[22] G., Ananthanarayanan, A., Ghodsi, S., Shenker, and I., Stoica, “Effective straggler mitigation: attack of the clones.” in NSDI, vol. 13, 2013, pp. 185–198.Google Scholar

[23] “Posix,” http://pubs.opengroup.org/onlinepubs/9699919799/.

[24] G., Ananthanarayanan, A., Ghodsi, A., Wang, et al., “Pacman: coordinated memory caching for parallel jobs,” in USENIX NSDI, 2012.Google Scholar

[25] G., Ananthanarayanan, S., Agarwal, S., Kandula, et al., “Scarlett: coping with skewed popularity content in mapreduce clusters,” in ACM EuroSys, 2011.Google Scholar

[26] M., Zaharia, M., Chowdhury, T., Das, et al., “Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing,” in USENIX NSDI, 2012.Google Scholar

[27] S., Melnik, A., Gubarev, J. J., Long, et al., “Dremel: Interactive analysis of web-scale datasets,” in Proceedings of the 36th International Conf on Very Large Data Bases, 2010, pp. 330–339.Google Scholar

[28] R., Xin, J., Rosen, M., Zaharia, et al., “Shark: SQL and rich analytics at scale,” in Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, 2013.Google Scholar

[29] S., Agarwal, B., Mozafari, A., Panda, et al., “Blinkdb: queries with bounded errors and bounded response times on very large data,” in Proceedings of the 8th European Conference on Computer Systems, ACM, 2013.Google Scholar

[30] J., Liu,W.-K., Shih, K.-J., Lin, R., Bettati, and J.-Y., Chung, “Imprecise computations.” in IEEE, 1994.Google Scholar

[31] S., Lohr, “Sampling: design and analysis,” in Thomson, 2009.

[32] Y., Chen, S., Alspaugh, D., Borthakur, and R., Katz, “Energy efficiency for large-scale mapreduce workloads with significant interactive analysis,” in Proceedings of the 7th ACM European Conference on Computer Systems, ACM, 2012, pp. 43–56.Google Scholar

[33] Z., Liu, Y., Chen, C., Bash, et al., “Renewable and cooling aware workload management for sustainable data centers,” in ACM SIGMETRICS Performance Evaluation Review, vol. 40, no. 1, ACM, 2012, pp. 175–186.Google Scholar

[34] A., Beloglazov, R., Buyya, Y. C., Lee, et al., “A taxonomy and survey of energy-efficient data centers and cloud computing systems,” Advances in Computers, vol. 82, no. 2, pp. 47–111, 2011.Google Scholar

[35] A., Gandhi, M., Harchol-Balter, R., Das, and C., Lefurgy, “Optimal power allocation in server farms,” in ACM SIGMETRICS Performance Evaluation Review, vol. 37, no. 1, ACM, 2009, pp. 157–168.Google Scholar

[36] A., Gandhi, V., Gupta, M., Harchol-Balter, and M. A., Kozuch, “Optimality analysis of energyperformance trade-off for server farmmanagement,” Performance Evaluation, vol. 67, no. 11, pp. 1155–1171, 2010.Google Scholar

[37] N., Buchbinder, N., Jain, and I., Menache, “Online job-migration for reducing the electricity bill in the cloud,” in NETWORKING 2011, Springer, 2011, pp. 172–185.Google Scholar

[38] “EC2 pricing,” http://aws.amazon.com/ec2/pricing/.

[39] I., Menache, O., Shamir, and N., Jain, “On-demand, spot, or both: dynamic resource allocation for executing batch jobs in the cloud,” in 11th International Conference on Autonomic Computing (ICAC), 2014.Google Scholar

Book contents

5 - Big data analytics systems

Summary

Access options

References

Save book to Kindle

Save book to Dropbox

Save book to Google Drive