Book contents
- Frontmatter
- Contents
- Introduction
- Part 1 Modeling Web Data
- Part 2 Web Data Semantics and Integration
- Part 3 Building Web Scale Applications
- 13 Web Search
- 14 An Introduction to Distributed Systems
- 15 Distributed Access Structures
- 16 Distributed Computing with MapReduce and Pig
- 17 Putting into Practice: Full-Text Indexing with Lucene
- 18 Putting into Practice: Recommendation Methodologies
- 19 Putting into Practice: Large-Scale Data Management with Hadoop
- 20 Putting into Practice: CouchDB, a JSON Semistructured Database
- Bibliography
- Index
14 - An Introduction to Distributed Systems
from Part 3 - Building Web Scale Applications
Published online by Cambridge University Press: 05 June 2012
- Frontmatter
- Contents
- Introduction
- Part 1 Modeling Web Data
- Part 2 Web Data Semantics and Integration
- Part 3 Building Web Scale Applications
- 13 Web Search
- 14 An Introduction to Distributed Systems
- 15 Distributed Access Structures
- 16 Distributed Computing with MapReduce and Pig
- 17 Putting into Practice: Full-Text Indexing with Lucene
- 18 Putting into Practice: Recommendation Methodologies
- 19 Putting into Practice: Large-Scale Data Management with Hadoop
- 20 Putting into Practice: CouchDB, a JSON Semistructured Database
- Bibliography
- Index
Summary
This chapter is an introduction to very large data management in distributed systems. Here, “very large” means a context where gigabytes (1,000 MB = 109 bytes) constitute the unit size for measuring data volumes. Terabytes (1012 bytes) are commonly encountered, and many Web companies and scientific or financial institutions must deal with petabytes (1015 bytes). In a near future, we can expect exabytes (1018 bytes) data sets, with the world-wide digital universe roughly estimated (in 2010) as about 1 zetabytes (1021 bytes).
Distribution is the key for handling very large data sets. Distribution is necessary (but not sufficient) to bring scalability (i.e., the means of maintaining stable performance for steadily growing data collections by adding new resources to the system). However, distribution brings a number of technical problems that make the design and implementation of distributed storage, indexing, and computing a delicate issue. A prominent concern is the risk of failure. In an environment that consists of hundreds or thousands of computers (a common setting for large Web companies), it becomes very common to face the failure of components (hardware, network, local systems, disks), and the system must be ready to cope with it at any moment.
Our presentation covers principles and techniques that recently emerged to handle Web-scale data sets. We examine the extension of traditional storage and indexing methods to large-scale distributed settings. We describe techniques to efficiently process point queries that aim at retrieving a particular object.
- Type
- Chapter
- Information
- Web Data Management , pp. 287 - 309Publisher: Cambridge University PressPrint publication year: 2011