Skip to main content Accessibility help
×
Hostname: page-component-76fb5796d-qxdb6 Total loading time: 0 Render date: 2024-04-28T00:58:53.658Z Has data issue: false hasContentIssue false

2 - MapReduce and Its Application to Massively Parallel Learning of Decision Tree Ensembles

from Part One - Frameworks for Scaling Up Machine Learning

Published online by Cambridge University Press:  05 February 2012

Biswanath Panda
Affiliation:
Google Inc., Mountain View, CA, USA
Joshua S. Herbach
Affiliation:
Google Inc., Mountain View, CA, USA
Sugato Basu
Affiliation:
Google Research, Mountain View, CA, USA
Roberto J. Bayardo
Affiliation:
Google Research, Mountain View, CA, USA
Ron Bekkerman
Affiliation:
LinkedIn Corporation, Mountain View, California
Mikhail Bilenko
Affiliation:
Microsoft Research, Redmond, Washington
John Langford
Affiliation:
Yahoo! Research, New York
Get access

Summary

In this chapter we look at leveraging the MapReduce distributed computing framework (Dean and Ghemawat, 2004) for parallelizing machine learning methods of wide interest, with a specific focus on learning ensembles of classification or regression trees. Building a production-ready implementation of a distributed learning algorithm can be a complex task. With the wide and growing availability of MapReduce-capable computing infrastructures, it is natural to ask whether such infrastructures may be of use in parallelizing common data mining tasks such as tree learning. For many data mining applications, MapReduce may offer scalability as well as ease of deployment in a production setting (for reasons explained later).

We initially give an overview of MapReduce and outline its application in a classic clustering algorithm, k-means. Subsequently, we focus on PLANET: a scalable distributed framework for learning tree models over large datasets. PLANET defines tree learning as a series of distributed computations and implements each one using the MapReduce model. We show how this framework supports scalable construction of classification and regression trees, as well as ensembles of such models. We discuss the benefits and challenges of using a MapReduce compute cluster for tree learning and demonstrate the scalability of this approach by applying it to a real-world learning task from the domain of computational advertising.

MapReduce is a simple model for distributed computing that abstracts away many of the difficulties in parallelizing data management operations across a cluster of commodity machines.

Type
Chapter
Information
Scaling up Machine Learning
Parallel and Distributed Approaches
, pp. 23 - 48
Publisher: Cambridge University Press
Print publication year: 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Alsabti, K., Ranka, S., and Singh, V. 1998. CLOUDS: A Decision Tree Classier for Large Datasets. Technical Reports, University of Florida.Google Scholar
Ben-Haim, Y., and Yom-Tov, E. 2008. A Streaming Parallel Decision Tree Algorithm. In: Large Scale Learning Challenge Workshop at the International Conference on Machine Learning (ICML).Google Scholar
Bradford, J. P., Fortes, J. A. B., and Bradford, J. 1999. Characterization and Parallelization of Decision Tree Induction. Technical Report, Purdue University.Google Scholar
Breiman, L. 1996. Bagging Predictors. Machine Learning Journal, 24(2), 123–140.CrossRefGoogle Scholar
Breiman, L. 2001. Random Forests. Machine Learning Journal, 45(1), 5–32.CrossRefGoogle Scholar
Breiman, L., Friedman, J. H., Olshen, R., and Stone, C. 1984. Classification and Regression Trees. Monterey, CA: Wadsworth and Brooks.Google Scholar
Caragea, D., Silvescu, A., and Honavar, V. 2004. A Framework for Learning from Distributed Data Using Sufficient Statistics and Its Application to Learning Decision Trees. International Journal of Hybrid Intelligent Systems, 1(1–2), 80–89.CrossRefGoogle ScholarPubMed
Caruana, R., and Niculescu-Mizil, A. 2006. An Empirical Comparison of Supervised Learning Algorithms. Pages 161–168 of: International Conference on Machine Learning (ICML).Google Scholar
Caruana, R., Karampatziakis, N., and Yessenalina, A. 2008. An Empirical Evaluation of Supervised Learning in High Dimensions. Pages 96–103 of: International Conference on Machine Learning (ICML).Google Scholar
Chan, P. K., and Stolfo, S. J. 1993. Toward Parallel and Distributed Learning byMeta-learning. Pages 227–240 of: Workshop on Knowledge Discovery in Databases at the Conference of Association for the Advancement of Artificial Intelligence (AAAI).Google Scholar
Chu, C.-T., Kim, S. K., Lin, Y.-A., Yu, Y., Bradski, G., Ng, A. Y., and Olukotun, K. 2007. Map-Reduce for Machine Learning on Multicore. Pages 281–288 of: Advances in Neural Information Processing Systems (NIPS) 19.Google Scholar
Dean, J., and Ghemawat, S. 2004. MapReduce: Simplified Data Processing on Large Clusters. In: Symposium on Operating System Design and Implementation (OSDI).Google Scholar
Duda, R. O., Hart, P. E., and Stork, D. G. 2001. Pattern Classification, 2nd ed. New York: Wiley.Google Scholar
Freund, Y., and Schapire, R. E. 1996. Experiments with a New Boosting Algorithm. Pages 148–156 of: International Conference on Machine Learning (ICML).Google Scholar
Friedman, J. H. 2001. Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics, 29(5), 1189–1232.CrossRefGoogle Scholar
Gao, J., Wu, Q., Burges, C., Svore, K., Su, Y., Khan, N., Shah, S., and Zhou, H. 2009 (August). Model Adaptation via Model Interpolation and Boosting forWeb Search Ranking. Pages 505–513 of: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing.
Gehrke, J., Ramakrishnan, R., and Ganti, V. 1998. RainForest – A Framework for Fast Decision Tree Construction of Large Datasets. Pages 416–427 of: International Conference on Very Large Data Bases (VLDB).Google Scholar
Gehrke, J., Ganti, V., Ramakrishnan, R., and Loh, W.-Y. 1999. BOAT – Optimistic Decision Tree Construction. Pages 169–180 of: International Conference on ACM Special Interest Group on Management of Data (SIGMOD).Google Scholar
Giannella, C., Liu, K., Olsen, T., and Kargupta, H. 2004. Communication Efficient Construction of Decision Trees over Heterogeneously Distributed Data. Pages 67–74 of: International Conference on Data Mining (ICDM).CrossRefGoogle Scholar
Jin, R., and Agrawal, G. 2003a. Efficient Decision Tree Construction on Streaming Data. Pages 571–576 of: SIGKDD Conference on Knowledge Discovery and Data Mining (KDD).Google Scholar
Jin, R., and Agrawal, G. 2003b. Communication and Memory Efficient Parallel Decision Tree Construction. Pages 119–129 of: SIAM Conference on Data Mining (SDM).Google Scholar
Joshi, M. V., Karypis, G., and Kumar, V. 1998. ScalParC: A New Scalable and Efficient Parallel Classification Algorithm for Mining Large Datasets. Pages 573–579 of: International Parallel Processing Symposium (IPPS).Google Scholar
Kaushik, A. 2007a (August). Bounce Rate as Sexiest Web Metric Ever. MarketingProfs. http://www.marketingprofs.com/7/bounce-rate-sexiest-web-metric-ever-kaushik.asp?sp=1.
Kaushik, A. 2007b (May). Excellent Analytics Tip 11: Measure Effectiveness of Your Web Pages. Occam's Razor (blog). www.kaushik.net/avinash/2007/05/excellent-analytics-tip-11-measureeffectiveness-of-your-web-pages.html.
Lazarevic, A. 2001. The Distributed Boosting Algorithm. Pages 311–316 of: SIGKDD Conference on Knowledge Discovery and Data Mining (KDD).Google Scholar
MacQueen, J. B. 1967. Some Methods for Classification and Analysis of Multivariate Observations. Pages 281–297 of: Cam, L. M. Le, and Neyman, J. (eds), Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1. Berkeley: University of California Press.Google Scholar
Manku, G. S., Rajagopalan, S., and Lindsay, B. G. 1999. Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets. Pages 251–262 of: International Conference on ACM Special Interest Group on Management of Data (SIGMOD).Google Scholar
Mehta, M., Agrawal, R., and Rissanen, J. 1996. SLIQ: A Fast Scalable Classifier for Data Mining. Pages 18–32 of: International Conference on Extending Data Base Technology (EDBT).Google Scholar
Provost, F., and Fayyad, U. 1999. A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery, 3, 131–169.CrossRefGoogle Scholar
Ridgeway, G. 2006. Generalized Boosted Models: A Guide to the GBM Package. http://cran.rproject.org/web/packages/gbm.
Rokach, L., and Maimon, O. 2008. Data Mining with Decision Trees: Theory and Applications. World Scientific.Google Scholar
Sculley, D., Malkin, R., Basu, S., and Bayardo, R. J. 2009. Predicting Bounce Rates in Sponsored Search Advertisements. Pages 1325–1334 of: SIGKDD Conference on Knowledge Discovery and Data Mining (KDD).Google Scholar
Shafer, J. C., Agrawal, R., and Mehta, M. 1996. SPRINT: A Scalable Parallel Classifier for Data Mining. Pages 544–555 of: International Conference on Very Large Data Bases (VLDB).Google Scholar
Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. Berlin: Springer.CrossRefGoogle Scholar
Zaki, M. J., Ho, C.-T., and Agrawal, R. 1999. Parallel Classification for Data Mining on Shared- Memory Multiprocessors. Pages 198–205 of: International Conference on Data Engineering (ICDE).Google Scholar

Save book to Kindle

To save this book to your Kindle, first ensure coreplatform@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats
×