Skip to main content Accessibility help
Hostname: page-component-7d684dbfc8-jcwnr Total loading time: 0 Render date: 2023-09-30T22:52:12.727Z Has data issue: false Feature Flags: { "corePageComponentGetUserInfoFromSharedSession": true, "coreDisableEcommerce": false, "coreDisableSocialShare": false, "coreDisableEcommerceForArticlePurchase": false, "coreDisableEcommerceForBookPurchase": false, "coreDisableEcommerceForElementPurchase": false, "coreUseNewShare": true, "useRatesEcommerce": true } hasContentIssue false

1 - Scaling Up Machine Learning: Introduction

Published online by Cambridge University Press:  05 February 2012

Ron Bekkerman
LinkedIn Corporation, Mountain View, CA, USA
Mikhail Bilenko
Microsoft Research, Redmond, WA, USA
John Langford
Yahoo! Research, New York, NY, USA
Ron Bekkerman
LinkedIn Corporation, Mountain View, California
Mikhail Bilenko
Microsoft Research, Redmond, Washington
John Langford
Yahoo! Research, New York
Get access


Distributed and parallel processing of very large datasets has been employed for decades in specialized, high-budget settings, such as financial and petroleum industry applications. Recent years have brought dramatic progress in usability, cost effectiveness, and diversity of parallel computing platforms, with their popularity growing for a broad set of data analysis and machine learning tasks.

The current rise in interest in scaling up machine learning applications can be partially attributed to the evolution of hardware architectures and programming frameworks that make it easy to exploit the types of parallelism realizable in many learning algorithms. A number of platforms make it convenient to implement concurrent processing of data instances or their features. This allows fairly straightforward parallelization of many learning algorithms that view input as an unordered batch of examples and aggregate isolated computations over each of them.

Increased attention to large-scale machine learning is also due to the spread of very large datasets across many modern applications. Such datasets are often accumulated on distributed storage platforms, motivating the development of learning algorithms that can be distributed appropriately. Finally, the proliferation of sensing devices that perform real-time inference based on high-dimensional, complex feature representations drives additional demand for utilizing parallelism in learning-centric applications. Examples of this trend include speech recognition and visual object detection becoming commonplace in autonomous robots and mobile devices.

Scaling up Machine Learning
Parallel and Distributed Approaches
, pp. 1 - 20
Publisher: Cambridge University Press
Print publication year: 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)


Bakir, G., Hofmann, T., Schölkopf, B., Smola, A., Taskar, B., and Vishwanathan, S. V. N. (eds). 2007. Predicting Structured Data. Cambridge, MA: MIT Press.Google Scholar
Bhaduri, K.,Wolff, R., Giannella, C., and Kargupta, H. 2008. Distributed Decision-Tree Induction in Peer-to-Peer Systems. Statistical Analysis and Data Mining, 1, 85–103.CrossRefGoogle Scholar
Bottou, L., Chapelle, O., DeCoste, D., and Weston, J. (eds). 2007. Large-Scale Kernel Machines. MIT Press.Google Scholar
Censor, Y., and Zenios, S. A. 1997. Parallel Optimization: Theory, Algorithms, and Applications. Oxford University Press.Google Scholar
Datta, S., Giannella, C. R., and Kargupta, H. 2009. Approximate Distributed K-Means Clustering over a Peer-to-Peer Network. IEEE Transactions on Knowledge and Data Engineering, 21, 1372–1388.CrossRefGoogle Scholar
Dean, Jeffrey, and Ghemawat, Sanjay. 2004. MapReduce: Simplified Data Processing on Large Clusters. In: Sixth Symposium on Operating System Design and Implementation (OSDI-2004).Google Scholar
Flynn, M. J. 1972. Some Computer Organizations and Their Effectiveness. IEEE Transactions on Computers, 21(9), 948–960.CrossRefGoogle Scholar
Freitas, A. A., and Lavington, S. H. 1998. Mining Very Large Databases with Parallel Processing. Kluwer.Google Scholar
Gropp, W., Lusk, E., and Skjellum, A. 1994. Using MPI: Portable Parallel Programming with the Message-Passing Interface. MIT Press.Google Scholar
Kargupta, H., and Chan, P. (eds). 2000. Advances in Distributed and Parallel Knowledge Discovery. Cambridge, MA: AAAI/MIT Press.Google Scholar
Lin, J., and Dyer, C. 2010. Data-Intensive Text Processing with MapReduce. Morgan & Claypool.Google Scholar
Luo, P., Xiong, H., Lu, K., and Shi, Z. 2007. Distributed classification in peer-to-peer networks. Pages 968–976 of: Proceedings of the 13th ACMSIGKDD International Conference on Knowledge Discovery and Data Mining.Google Scholar
Provost, F., and Kolluri, V. 1999. A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery, 3(2), 131–169.CrossRefGoogle Scholar
Rajaraman, A., and Ullman, J. D. 2010. Mining of Massive Datasets.
Smola, A. J., and Narayanamurthy, S. 2010. An Architecture for Parallel Topic Models. Proceedings of the VLDB Endowment, 3(1), 703–710.CrossRefGoogle Scholar
Ye, J., Chow, J.-H., Chen, J., and Zheng, Z. 2009. Stochastic Gradient Boosted Distributed Decision Trees. In: CIKM '09 Proceeding of the 18th ACM Conference on Information and Knowledge Management.CrossRefGoogle Scholar
Zaki, M. J., and Ho, C.-T. (eds). 2000. Large-scale Parallel Data Mining. New York: Springer.

Save book to Kindle

To save this book to your Kindle, first ensure is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the or variations. ‘’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats