Book contents
- Frontmatter
- Contents
- Contributors
- Preface
- 1 Scaling Up Machine Learning: Introduction
- Part One Frameworks for Scaling Up Machine Learning
- Part Two Supervised and Unsupervised Learning Algorithms
- 6 PSVM: Parallel Support Vector Machines with Incomplete Cholesky Factorization
- 7 Massive SVM Parallelization Using Hardware Accelerators
- 8 Large-Scale Learning to Rank Using Boosted Decision Trees
- 9 The Transform Regression Algorithm
- 10 Parallel Belief Propagation in Factor Graphs
- 11 Distributed Gibbs Sampling for Latent Variable Models
- 12 Large-Scale Spectral Clustering with Map Reduce and MPI
- 13 Parallelizing Information-Theoretic Clustering Methods
- Part Three Alternative Learning Settings
- Part Four Applications
- Subject Index
- References
8 - Large-Scale Learning to Rank Using Boosted Decision Trees
from Part Two - Supervised and Unsupervised Learning Algorithms
Published online by Cambridge University Press: 05 February 2012
- Frontmatter
- Contents
- Contributors
- Preface
- 1 Scaling Up Machine Learning: Introduction
- Part One Frameworks for Scaling Up Machine Learning
- Part Two Supervised and Unsupervised Learning Algorithms
- 6 PSVM: Parallel Support Vector Machines with Incomplete Cholesky Factorization
- 7 Massive SVM Parallelization Using Hardware Accelerators
- 8 Large-Scale Learning to Rank Using Boosted Decision Trees
- 9 The Transform Regression Algorithm
- 10 Parallel Belief Propagation in Factor Graphs
- 11 Distributed Gibbs Sampling for Latent Variable Models
- 12 Large-Scale Spectral Clustering with Map Reduce and MPI
- 13 Parallelizing Information-Theoretic Clustering Methods
- Part Three Alternative Learning Settings
- Part Four Applications
- Subject Index
- References
Summary
The web search ranking task has become increasingly important because of the rapid growth of the internet. With the growth of the web and the number of web search users, the amount of available training data for learning web ranking models has also increased. We investigate the problem of learning to rank on a cluster using web search data composed of 140,000 queries and approximately 14 million URLs. For datasets much larger than this, distributed computing will become essential, because of both speed and memory constraints. We compare a baseline algorithm that has been carefully engineered to allow training on the full dataset using a single machine, in order to evaluate the loss or gain incurred by the distributed algorithms we consider. The underlying algorithm we use is a boosted tree ranking algorithm called LambdaMART, where a split at a given vertex in each decision tree is determined by the split criterion for a particular feature. Our contributions are twofold. First, we implement a method for improving the speed of training when the training data fits in main memory on a single machine by distributing the vertex split computations of the decision trees. The model produced is equivalent to the model produced from centralized training, but achieves faster training times. Second, we develop a training method for the case where the training data size exceeds the main memory of a single machine. Our second approach easily scales to far larger datasets, that is, billions of examples, and is based on data distribution.
- Type
- Chapter
- Information
- Scaling up Machine LearningParallel and Distributed Approaches, pp. 148 - 169Publisher: Cambridge University PressPrint publication year: 2011
References
- 7
- Cited by