We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Based on the authors' extensive teaching experience, this hands-on graduate-level textbook teaches how to carry out large-scale data analytics and design machine learning solutions for big data. With a focus on fundamentals, this extensively class-tested textbook walks students through key principles and paradigms for working with large-scale data, frameworks for large-scale data analytics (Hadoop, Spark), and explains how to implement machine learning to exploit big data. It is unique in covering the principles that aspiring data scientists need to know, without detail that can overwhelm. Real-world examples, hands-on coding exercises and labs combine with exceptionally clear explanations to maximize student engagement. Well-defined learning objectives, exercises with online solutions for instructors, lecture slides, and an accompanying suite of lab exercises of increasing difficulty in Jupyter Notebooks offer a coherent and convenient teaching package. An ideal teaching resource for courses on large-scale data analytics with machine learning in computer/data science departments.
This chapter introduces what we mean by big data, its importance, and the key principles for handling it efficiently. The term “big data” is frequently used to refer to the idea of exploiting lots of data to obtain some benefit, but there is no standard definition. It is also commonly known as large-scale data processing or data-intensive applications. We discuss the key components of big data, and how it is not all about volume, but also other aspects such as velocity and variety need to be considered. The world of big data has multiple faces such as databases, infrastructure, and security, but we focus on data analytics. Then, we cover how to deal with big data, explaining why we cannot scale up using a single computer, but we must scale out and use multiple machines to process the data. We suggest why traditional high-performance computing clusters are not appropriate for data-intensive applications and how they would collapse the network. Finally, we introduce key features of a big data cluster such as not being focused on pure computation, no need for high-end computers, the need for fault tolerance mechanisms, and respecting the principle of data locality.
This chapter introduces the machine learning side of things of this book. Although we assume some prior experience in machine learning, we start off with a full recap of the basic concepts and key terminology. This includes a discussion of learning paradigms, such as supervised and unsupervised learning, and the machine learning life cycle, articulating the steps to go from data collection to model deployment. We cover topics like data preparation and preprocessing, model evaluation and selection, and machine learning pipelines, showing how all the stages of this cycle are susceptible to being compromised when we talk about large-scale data analytics. After that, the rest of the chapter is devoted to the machine learning library of Spark, MLLib. Basic concepts such as Transformers, Estimators, and Pipelines are presented with an example using linear regression. The example provided forces us to use a pipeline of methods to get the data ready for training. This allows us to introduce some of the data preparation packages of Spark (e.g., VectorAssembler or StandardScaler). Finally, we explore evaluation packages (e.g., RegressionEvaluator) and how to perform hyperparameter tuning.
MapReduce is a parallel programming model that follows a simple divide-and-conquer strategy to tackle big datasets in distributed computing. This chapter begins with a discussion of the key distinguishing features and differences of MapReduce with respect to similar distributing computing tools like Message Passing Interface (MPI). Then, we introduce its two main functions, map and reduce, based on functional programming. After that, the notation of how MapReduce works is presented using the classical WordCount example as the Hello World of big data, discussing different ways to parallelize it and their main advantages and disadvantages. Next, we delve into MapReduce a bit more formally, and its functions in terms of key–value pairs, as well as the key properties of the map, shuffle, and reduce operations. At the end of the chapter we cover some important details as to how to achieve fault tolerance, how to exploit MapReduce to preserve data locality, how it can reduce data transfer across computers using combiners, and additional information about its internal working.
The goal of this chapter is to present complete examples of the design and implementation of machine learning methods in large-scale data analytics. In particular, we choose three distinct topics: semi-supervised learning, ensemble learning, and how to deploy deep learning models at scale. Each of them is introduced, motivating why parallelization to deal with big data is needed, determining the main bottlenecks, designing and coding Spark-based solutions, and discussing further work required to improve the code. In semi-supervised learning, we focus on the simplest self-labeling approach called self-training, and a global solution for it. Likewise, in ensemble learning, we design a global approach for bagging and boosting. Lastly, we show an example with deep learning. Rather than parallelizing the training of a model, which is typically easier on GPUs, we deploy the inference step for a case study in semantic image segmentation.
Hadoop is an open-source framework, written in Java, for big data processing and storage that is based on the MapReduce programming model. This chapter starts off with a brief introduction to Hadoop and how it has evolved to become a solid base platform for most big data frameworks. We show how to implement the classical Word Count using Hadoop MapReduce, highlighting the difficulties in doing so. After that, we provide essential information about how the resource negotiator, YARN, and its distributed file system, HDFS, work. We describe step by step how a MapReduce process is executed on YARN, introducing the concepts of resource and node managers, application master, and containers, as well as the different execution models (standalone, pseudo-distributed, and fully-distributed). Likewise, we talk about the HDFS, covering the basic design of this filesystem, and what it means in terms of functionality and efficiency. We also discuss recent advances such as erasure coding, HDFS federation, and high availability. Finally, we expose the main limitations of Hadoop and how it has sparked the rise of many new big data frameworks, which now coexist within the Hadoop ecosystem.
This chapter puts forward new guidelines for designing and implementing distributed machine learning algorithms for big data. First, we present two different alternatives, which we call local and global approaches. To show how these two strategies work, we focus on the classical decision tree algorithm, revising its functioning and some details that need modification to deal with large datasets. We implement a local-based solution for decision trees, comparing its behavior and efficiency against a sequential model and the MLlib version. We also discuss the nitty-gritty of the implementation of decision trees in MLlib as a great example of a global solution. That allows us to formally define these two concepts, discussing the key (expected) advantages and disadvantages. The second part is all about measuring the scalability of a big data solution. We talk about three classical metrics, speed-up, size-up, and scale-up, to help understand if a distributed solution is scalable. Using these, we test our local-based approach and compare it against its global counterpart. This experiment allows us to give some tips for calculating these metrics correctly using a Spark cluster.
This chapter introduces Spark, a data processing engine that mitigates some of the limitations of Hadoop MapReduce to perform data analytics efficiently. We begin with the motivation of Spark, introducing Spark RDDs as an in-memory distributed data structure that allows for faster processing while maintaining the attractive properties of Hadoop, such as fault tolerance. We then cover, hands-on, how to create and operate with RDDs, distinguishing between transformations and actions. Furthermore, we discuss how to work with key–value RDDs (which is more like MapReduce), how to use caching to perform iterative queries/operations, and how RDD lineage works to ensure fault tolerance. We provide a great range of examples with transformations such as map vs. flatMap and groupByKey vs. reduceByKey, discussing their behavior, adequacy (depending on what we want to achieve), and their performance. More advanced concepts, such as shared variables (broadcast and accumulators) or work by partitions are presented towards the end. Finally, we talk about the anatomy of a Spark application, as well as the different types of dependencies (narrow vs. wide) and the limitations on optimizing their processing.
In this chapter we cover clustering and regression, looking at two traditional machine learning methods: k-means and linear regression. We briefly discuss how to implement these methods in a non-distributed manner first, to then carefully analyze the bottlenecks of these methods when manipulating big data. This enables us to design global-based solutions based on the DataFrame API of Spark. The key focus is on the principles for designing solutions effectively. Nevertheless, some of the challenges in this chapter are to investigate tools from Spark to speed up the processing even further. k-means is an example of an iterative algorithm, and how to exploit caching in Spark, and we analyze its implementation with both RDD and DataFrame APIs. For linear regression, we first implement the closed form, which involves numerous matrix multiplications and outer products, to simplify the processing in big data. Then, we look at gradient descent. These examples give us the opportunity to expand on the principles of designing a global solution, and also allow us to show how knowing the underlying platform, Spark in this case, well is essential to really maximize the performance.
Spark SQL is a module in Spark for structured data processing, which improves upon RDDs. The chapter explains how imposing structure on the data helps Spark perform further optimizations. We talk about the transition from RDDs to DataFrames and Datasets, including a brief description of the Catalyst and Tungsten projects. In Python, we don’t have Datasets, and we focus on DataFrames. With a learn-by-example approach, we see how to create DataFrames, inferring the schema automatically or manually, and operate with them. We show how these operations usually feel more natural for SQL developers, but we can interact with this API following an object-oriented programming style or SQL. Like we did with RDDs, we showcase various examples to demonstrate the functioning of different operations with DataFrames. Starting from standard transformations such as select or filter, we move to more peculiar operations like Column transformations and how to perform efficient aggregations using Spark functions. As advanced content, we include implementing user-defined functions for DataFrames, as well as an introduction to pandas-on-Spark, a powerful API for those programmers more used to pandas.
Recommend this
Email your librarian or administrator to recommend adding this to your organisation's collection.