Skip to main content Accessibility help
×
Hostname: page-component-8448b6f56d-c47g7 Total loading time: 0 Render date: 2024-04-19T12:12:39.105Z Has data issue: false hasContentIssue false

8 - Beyond MapReduce: New Requirements for Scalable Data Processing

Published online by Cambridge University Press:  05 December 2012

Bill Howe
Affiliation:
University of Washington
Magdalena Balazinska
Affiliation:
University of Washington
Ian Gorton
Affiliation:
Pacific Northwest National Laboratory, Washington
Deborah K. Gracio
Affiliation:
Pacific Northwest National Laboratory, Washington
Get access

Summary

Introduction and Background

The MapReduce programming model has had a transformative impact on dataintensive computing, enabling a single programmer to harness hundreds or thousands of computers for a single task and get up and running in a matter of hours. Processing with thousands of computers require a different set of design considerations dominate: I/O scalability, fault tolerance, and flexibility rather than absolute performance. MapReduce, and the open-source implementation Hadoop, are optimized for these considerations and have become very successful as a result.

It is difficult to quantify the popularity of the MapReduce framework directly, but one indication of the uptake is the frequency of the search term. Figure 8.1 illustrates the search popularity for terms “mapreduce” and “hadoop” over the period 2006 to 2012. We see a spike in popularity for the term “mapreduce” in late 2007, but more or less constant popularity since. For the term “hadoop,” however, we see a steady increase to about twelve times that of “mapreduce.”

These data suggest that MapReduce and Hadoop are generating interest, as seen from the number of downloads, successful startups [12, 19, 47], projects [41, 53, 57], and interest from the research community [15, 18, 62, 63, 72, 78]. These data suggest a significant increase in interest in both MapReduce and Hadoop.

The MapReduce framework provides a simple programming model for expressing loosely coupled parallel programs by providing two serial functions, Map and Reduce. The Map function processes a block of input producing a sequence of (key, value) pairs, while the Reduce function processes a set of values associated with a single key.

Type
Chapter
Information
Data-Intensive Computing
Architectures, Algorithms, and Applications
, pp. 180 - 234
Publisher: Cambridge University Press
Print publication year: 2012

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

1. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., and Rasin, A. “An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads.” In Proc. of the 35th VLDB Conf., 2009.Google Scholar
2. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D. J., Rasin, A., and Silberschatz, A.HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads.” VLDB 2, no. 1 (2009): 922–33.Google Scholar
3. Amazon Elastic Compute Cloud (Amazon EC2). http://aws.amazon. com/ec2/. Accessed May 2012.
4. Amazon SimpleDB. http://www.amazon.com/simpledb/. Accessed May 2012.
5. Oceanic remote chemical analyzer (ORCA). http://armbrustlab. ocean.washington.edu/.
6. Bancilhon, F., and Ramakrishnan, R. “An Amateur's Introduction to Recursive Query Processing Strategies.” In SIGMOD Conference, pages 16–52, 1986.Google Scholar
7. Berger, M. J., and Bokhari, S. H.A Partitioning Strategy for Nonuniform Problems on Multiprocessors.” Computers, IEEE Transactions on, C-36(5), 1987.Google Scholar
8. Boulos, J., and Ono, K.Cost Estimation of User-Defined Methods in Object-Relational Database Systems.” SIGMOD Record, 28, no. 3 (1999).CrossRefGoogle Scholar
9. Cary, A., Sun, Z., Hristidis, V., and Rishe, N. “Experiences on processing spatial data with mapreduce.” In Proceedings of the 21st International Conference on Scientific and Statistical Database Management, SSDBM 2009, pages 302–19, Berlin, Heidelberg: Springer-Verlag, 2009.CrossRefGoogle Scholar
10. Chaiken, R., Jenkins, B., Larson, P.-Å., Ramsey, B., Shakib, D., Weaver, S., and Zhou, J.SCOPE: “Easy and Efficient Parallel Processing of Massive Data Sets. In Proc. of the 34th VLDB Conf., pages 1265–76, 2008.Google Scholar
11. Chaudhuri, S., and R. Narasayya, V. “Self-Tuning Database Systems: A Decade of Progress.” In Proc. of the 33rd VLDB Conf., pages 3–14, 2007.Google Scholar
12. Cloudera. http://www.cloudera.com/.
13. Nsf cluster exploratory program. http://www.nsf.gov/pubs/2008/nsf08560/nsf08560.htm. Accessed July 7, 2010.
14. Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J. M., andWelton, C.MAD skills: New Analysis Practices for Big Data.” Proc. of the VLDB Endowment, 2, no. 2 (2009): 1481–92.CrossRefGoogle Scholar
15. Condie, T., Conway, N., Alvaro, P., Hellerstein, J. M., Elmeleegy, K., and Sears, R. “MapReduce Online.” In Symposium on Networked Systems Design and Implementation, pages 21–21, 2010.Google Scholar
16. CouchDB. http://couchdb.apache.org/. Accessed May 2012.
17. Dageville, B., Das, D., Dias, K., Yagoub, K., Mohamed, Z., and Mohamed, Z. “Automatic SQL Tuning in Oracle 10g.” In Proc. of the 30th VLDB Conf., pages 1098–1109, 2004.Google Scholar
18. Das, S., Sismanis, Y., Beyer, K. S.Gemulla, R.,Haas, P. J., and McPherson, J. “Ricardo: Integrating R and Hadoop.” In Proc. of the ACMSIGMOD International Conference on Management of Data, pages 987–98, 2010.Google Scholar
19. Datameer. http://www.datameer.com./. Accessed May 2012.
20. Davis, M., Efstathiou, G., Frenk, C. S., and White, S. D. M.The Evolution of Large-Scale Structure in a Universe Dominated by Cold Dark Matter.” Astroph. J. 292 (May 1985): 371–94.Google Scholar
21. Dean, J. and Ghemawat, S.. MapReduce: “Simplified Data Processing on Large Clusters.” In Proc. of the 6th OSDI Symp., 2004.Google Scholar
22. Dean, J. and Ghemawat, S. “MapReduce: Simplified Data Processing on Large Clusters.” In OSDI, pages 137–50, 2004.Google Scholar
23. Deshpande, A., Ives, Z., and Raman, V.Adaptive Query Processing. Foundations and Trends in Databases 1, no. 1 (2007): 139.CrossRefGoogle Scholar
24. Devine, K., Boman, E., and Karypis, G.Parallel Processing for Scientific Computing, chapter 6. Society for Industrial and Applied Mathematics, 2006.Google Scholar
25. Devine, K., Boman, E., Heapby, R., Hendrickson, B., and Vaughan, C.Zoltan Data Management Service for Parallel Dynamic Applications. Computing in Science and Eng. 4, no. 2 (2002): 90–96.CrossRefGoogle Scholar
26. DeWitt, D., and Jim, G.Parallel Database Systems: The Future of High Performance Database Systems. CACM, 35, no. 6 (1992): 85–98.CrossRefGoogle Scholar
27. Dewitt, D., and Stonebraker, M. MapReduce: A major step backwards. http://databasecolumn.vertica.com/database-innovation/mapreduce-a-major-stepbackwards/.
28. DeWitt, D. J., and Gray, J.Parallel Database Systems: The Future of High Performance Database Systems. Commun. ACM 35, no. 6 (1992): 85–98.CrossRefGoogle Scholar
29. DeWitt, D. J., Naughton, J. F., Schneider, D. A., and Seshadri, S. “Practical Skew Handling in Parallel Joins.” In Proc. of the 18th VLDB Conf., 1992.Google Scholar
30. Ekanayake, J., and Pallickara, S. “MapReduce for Data Intensive Scientific Analysis.” In IEEE eScience, pages 277–84, 2008.Google Scholar
31. Fisher, K.,Walker, D., and Zhu, K. Q. “Learnpads: Automatic Tool Generation from Ad Hoc Data.” In SIGMOD Conference, pages 1299–1302, 2008.Google Scholar
32. Fisher, K., Walker, D., Zhu, K. Q., and White, P.From Dirt to Shovels: Fully Automatic Tool Generation from Ad Hoc Data.” In POPL, pages 421–34, 2008.Google Scholar
33. Gelb, J. M., and Bertschinger, E.Cold DarkMatter. 1: The Formation of Dark Halos. Astroph. J. 436 (December 1994): 467–90.CrossRefGoogle Scholar
34. Ghemawat, S., Gobioff, H., and Leung, S.-T. “The Google File System.” In Proc. of the 19th SOSP Symp., pages 29–43, 2003.Google Scholar
35. Graham, R. L.Bounds on Multiprocessing Timing Anomalies. SIAM Journal on Applied Mathematics, 17, no. 2 (1969): 416–29.CrossRefGoogle Scholar
36. Greenplum database. http://www.greenplum.com/. Accessed May 2012.
37. Hadoop. Accessed July 7, 2010. http://hadoop.apache.org/.
38. Hadoop – capacity scheduler guide. http://hadoop.apache.org/mapreduce/docs/current/capacity_scheduler.html.
39. Hdfs. Accessed July 7, 2010. http://hadoop.apache.org/common/docs/current/hdfsdesign.html.
40. Bingsheng, H., Yang, M., Guo, Z., Chen, R., Su, B., Lin, W., and Zhou, L. “Comet: Batched Stream Processing for Data Intensive Distributed Computing.” In Proc. of the 1st ACM symposium on Cloud computing, pages 63–74, 2010.Google Scholar
41. Hive. http://hadoop.apache.org/hive/.
42. Howe, B., and Maier, D. “Algebraic Manipulation of Scientific Datasets.” In VLDB '04: Proceedings of the 30th International Conference on Very Large Data Bases, Toronto, Ontario, CA, 2004.Google Scholar
43. Howe, B., Maier, D., and Bright, L. “Smoothing the Roi Curve for Scientific Data Management Applications.” In Proc. of the Third CIDR Conf., 2007.Google Scholar
44. Hua, K. A., and Lee, C. “Handling Data Skew in Multiprocessor Database Computers Using Partition Tuning. In Proc. of the 17th VLDB Conf., 1991.Google Scholar
45. Isard, M., Budiu, M., Yu, Y., Birrell, A., and Fetterly, D. “Dryad: Distributed Dataparallel Programs from Sequential Building Blocks.” In Proc. of the EuroSys Conf., pages 59–72, 2007.Google Scholar
46. Isard, M., Budiu, M., Yu, Y., Birrell, A., and Fetterly, D. “Dryad: Distributed Data-Parallel Programs From Sequential Building Blocks.” In EuroSys, pages 59–72, 2007.Google Scholar
47. Karmasphere. http://www.karmasphere.com/. Accessed May 2012.
48. Kleinberg, J. M.Authoritative Sources in a Hyperlinked Environment. J. ACM, 46 no. 5 (1999): 604–32.CrossRefGoogle Scholar
49. Knollmann, S. R., and Knebe, A.AHF: Amiga's Halo Finder.” Astroph. J. Suppl. 182 (June 2009): 608–24.CrossRefGoogle Scholar
50. Kossmann, D., Franklin, M. J., Drasch, G., and Ag, W.Cache Investment: Integrating Query Optimization and Distributed Data Placement. ACM TODS 25, no. 4 (2000): 517–58.CrossRefGoogle Scholar
51. Kwon, et al. Scalable Clustering Algorithm for N-Body Simulations in a Shared-Nothing cluster. Technical Report UW-CSE-09-06-01, Dept. of Comp. Sci., Univ. of Washington, 2009.Google Scholar
52. Lin, J. “The Curse of Zipf and Limits to Parallelization: A Look at the Stragglers Problem in MapReduce.” In 7th Workshop on Large-Scale Distributed Systems for Information Retrieval, 2009.Google Scholar
53. Mahout. http://lucene.apache.org/mahout/. Accessed July 7, 2010.
54. Malewicz, G., Austern, M. H., Bik, A. J. C., Dehnert, J. C., Horn, I., Leiser, N., and Czajkowski, G. “Pregel: A System for Large-Scale Graph Processing.” In SIGMOD Conference, pages 135–46, 2010.Google Scholar
55. Oliker, L., and Biswas, R.Plum: Parallel Load Balancing for Adaptive Unstructured Meshes. J. Parallel Distrib. Comput. 52, no. 2 (1998): 150–77.CrossRefGoogle Scholar
56. Olston, C., Bortnikov, E., Elmeleegy, K., Junqueira, F., and Reed, B. “Interactive Analysis of Web-Scale Data.” In Fourth CIDR Conf. – Perspectives, 2009.Google Scholar
57. Olston, C., Reed, B., Srivastava, U., Kumar, R., and Tomkins, A. “Pig Latin: A Not-So-Foreign Language for Data Processing.” In SIGMOD Conference, pages 1099–10, 2008.Google Scholar
58. Olston, C., Reed, B., Srivastava, U., Kumar, R., and Tomkins, A.. “Pig Latin: A Not-So-Foreign Language for Data Processing.” In Proc. of the SIGMOD Conf., pages 1099–10, 2008.Google Scholar
59. Oracle. http://www.oracle.com/database/.
60. Page, L., Brin, S., Motwani, R., and Winograd, T. The PageR-ank Citation Ranking: Bringing Order to the Web. Technical Report 19991966, Stanford InfoLab, 1999.
61. Parashar, M., Liu, H., Li, Z., Matossian, V., Schmidt, C., Zhang, G., and Hariri, S.AutoMate: Enabling Autonomic Applications on the Grid. Cluster Computing 9, no. 2 (2006): 48–57.CrossRefGoogle Scholar
62. Pavlo, A., Paulson, E., Rasin, A., Abadi, D. J., DeWitt, D. J., Madden, S., and Stonebraker, M. “A Comparison of Approaches to Large-Scale Data Analysis.” In SIGMOD Conference, pages 165–78, 2009.Google Scholar
63. Pavlo, A., Paulson, E., Rasin, A., Abadi, D. J., DeWitt, D. J., Madden, S., and Stonebraker, M. “A Comparison of Approaches to Large-Scale Data Analysis.” In Proc. of the SIGMOD Conf., pages 165–78, 2009.Google Scholar
64. Qiu, X., Ekanayake, J., Beason, S., Gunarathne, T., Fox, G., Barga, R., and Gannon, D. “Cloud Technologies for Bioinformatics Applications.” In MTAGS '09: Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers, pages 1–10, 2009.Google Scholar
65. Shatdal, A. and Naughton, J. “Adaptive Parallel Aggregation Algorithms.” In Proc. of the SIGMOD Conf., 1995.Google Scholar
66. Springel, V., White, S. D. M., Jenkins, A., Frenk, C. S., Yoshida, N., Gao, L., Navarro, J., Thacker, R., Croton, D., Helly, J., Peacock, J. A., Cole, S., Thomas, P., Couchman, H., Evrard, A., Colberg, J., and Pearce, F.Simulations of the Formation, Evolution and Clustering of Galaxies and Quasars.” NATURE 435 (June 2005): 629–36.CrossRefGoogle ScholarPubMed
67. Sql server. http://www.microsoft.com/sqlserver/. Accessed May 2012.
68. Stadel, J. G.. Cosmological N-body Simulations and Their Analysis. PhD thesis, University of Washington, 2001.Google Scholar
69. Stonebraker, M., Abadi, D. J., DeWitt, D. J., Madden, S., Paulson, E., Pavlo, A., and Rasin, A.Mapreduce and Parallel Dbmss: Friends or Foes?CACM, 53, no. 1 (January 2010).CrossRefGoogle Scholar
70. Stonebraker, M., Becla, J., DeWitt, D., Lim, K.-T., Maier, D., Ratzesberger, O., and Zdonik, S. “Requirements for Science Data Bases and SciDB.” In Fourth CIDR Conf. – Perspectives, 2009.Google Scholar
71. Teradata. http://www.teradata.com/.
72. Thusoo, A., Shao, Z., Anthony, S., Borthakur, D., Jain, N., Sarma, J. S., Murthy, R., and Liu, H. “Data Warehousing and Analytics Infrastructure at Facebook.” In Proc. of the ACM SIGMOD International Conference on Management of Data, pages 1013–20, 2010.Google Scholar
73. Walton, C. B., Dale, A. G., and Jenevein, R. M. “A Taxonomy and Performance Model of Data Skew Effects in Parallel Joins.” In Proc. of the 17th VLDB Conf., 1991.Google Scholar
74. Snodgrass, R. T., Li, W., and Gao, D. “Skew Handling Techniques in Sort-Merge Join.” In Proc. of the SIGMOD Conf., 2002.Google Scholar
75. Weinberg, D. H.,Hernquist, L., and Katz, N.Photoionization, Numerical Resolution, and Galaxy Formation.” Astroph. J. 477 (March 1997): 8−+.CrossRefGoogle Scholar
76. Xldb workshop. http://www-conf.slac.stanford.edu/xldb/. Accessed May 2012.
77. Xu, Y., and Kostamaa, P. “Efficient Outer Join Data Skew Handling in Parallel DBMS.” In VLDB, 2009.Google Scholar
78. Xu, Y., Kostamaa, P., and Gao, L. “Integrating Hadoop and Parallel DBMs.” In Proc. of the ACM SIGMOD International Conference on Management of Data, pages 969–74, 2010.Google Scholar
79. Xu, Y., Kostamaa, P., Zhou, X., and Chen, L. “Handling Data Skew in Parallel Joins in Shared-Nothing Systems.” In Proc. of the SIGMOD Conf., pages 1043–52, 2008.Google Scholar
80. Yu, Y., Gunda, P. K., and Isard, M. “Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations.” In Proc. of the 22nd SOSP Symp., 2009.Google Scholar
81. Zhang, W., Wang, K., and Chau, S.-C.Data Partition and Parallel Evaluation of Datalog Programs.” IEEE Trans. Knowl. Data Eng., 7, no. 1 (1995): 163–76.CrossRefGoogle Scholar
82. Zhang, Y., Herodotou, H., and Yang, J.RIOT: I/O-Efficient Numerical Computing Without SQL. In Proc. of the Fourth CIDR Conf., 2009.Google Scholar

Save book to Kindle

To save this book to your Kindle, first ensure coreplatform@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats
×