Book contents
- Frontmatter
- Contents
- Contributors
- Preface
- 1 Scaling Up Machine Learning: Introduction
- Part One Frameworks for Scaling Up Machine Learning
- 2 MapReduce and Its Application to Massively Parallel Learning of Decision Tree Ensembles
- 3 Large-Scale Machine Learning Using DryadLINQ
- 4 IBM Parallel Machine Learning Toolbox
- 5 Uniformly Fine-Grained Data-Parallel Computing for Machine Learning Algorithms
- Part Two Supervised and Unsupervised Learning Algorithms
- Part Three Alternative Learning Settings
- Part Four Applications
- Subject Index
- References
2 - MapReduce and Its Application to Massively Parallel Learning of Decision Tree Ensembles
from Part One - Frameworks for Scaling Up Machine Learning
Published online by Cambridge University Press: 05 February 2012
- Frontmatter
- Contents
- Contributors
- Preface
- 1 Scaling Up Machine Learning: Introduction
- Part One Frameworks for Scaling Up Machine Learning
- 2 MapReduce and Its Application to Massively Parallel Learning of Decision Tree Ensembles
- 3 Large-Scale Machine Learning Using DryadLINQ
- 4 IBM Parallel Machine Learning Toolbox
- 5 Uniformly Fine-Grained Data-Parallel Computing for Machine Learning Algorithms
- Part Two Supervised and Unsupervised Learning Algorithms
- Part Three Alternative Learning Settings
- Part Four Applications
- Subject Index
- References
Summary
In this chapter we look at leveraging the MapReduce distributed computing framework (Dean and Ghemawat, 2004) for parallelizing machine learning methods of wide interest, with a specific focus on learning ensembles of classification or regression trees. Building a production-ready implementation of a distributed learning algorithm can be a complex task. With the wide and growing availability of MapReduce-capable computing infrastructures, it is natural to ask whether such infrastructures may be of use in parallelizing common data mining tasks such as tree learning. For many data mining applications, MapReduce may offer scalability as well as ease of deployment in a production setting (for reasons explained later).
We initially give an overview of MapReduce and outline its application in a classic clustering algorithm, k-means. Subsequently, we focus on PLANET: a scalable distributed framework for learning tree models over large datasets. PLANET defines tree learning as a series of distributed computations and implements each one using the MapReduce model. We show how this framework supports scalable construction of classification and regression trees, as well as ensembles of such models. We discuss the benefits and challenges of using a MapReduce compute cluster for tree learning and demonstrate the scalability of this approach by applying it to a real-world learning task from the domain of computational advertising.
MapReduce is a simple model for distributed computing that abstracts away many of the difficulties in parallelizing data management operations across a cluster of commodity machines.
- Type
- Chapter
- Information
- Scaling up Machine LearningParallel and Distributed Approaches, pp. 23 - 48Publisher: Cambridge University PressPrint publication year: 2011
References
- 4
- Cited by