Book contents
- Frontmatter
- Contents
- Introduction
- Part 1 Modeling Web Data
- Part 2 Web Data Semantics and Integration
- Part 3 Building Web Scale Applications
- 13 Web Search
- 14 An Introduction to Distributed Systems
- 15 Distributed Access Structures
- 16 Distributed Computing with MapReduce and Pig
- 17 Putting into Practice: Full-Text Indexing with Lucene
- 18 Putting into Practice: Recommendation Methodologies
- 19 Putting into Practice: Large-Scale Data Management with Hadoop
- 20 Putting into Practice: CouchDB, a JSON Semistructured Database
- Bibliography
- Index
13 - Web Search
from Part 3 - Building Web Scale Applications
Published online by Cambridge University Press: 05 June 2012
- Frontmatter
- Contents
- Introduction
- Part 1 Modeling Web Data
- Part 2 Web Data Semantics and Integration
- Part 3 Building Web Scale Applications
- 13 Web Search
- 14 An Introduction to Distributed Systems
- 15 Distributed Access Structures
- 16 Distributed Computing with MapReduce and Pig
- 17 Putting into Practice: Full-Text Indexing with Lucene
- 18 Putting into Practice: Recommendation Methodologies
- 19 Putting into Practice: Large-Scale Data Management with Hadoop
- 20 Putting into Practice: CouchDB, a JSON Semistructured Database
- Bibliography
- Index
Summary
With a constantly increasing size of dozens of billions of freely accessible documents, one of the major issues raised by the World Wide Web is that of searching in an effective and efficient way through these documents to find those that best suit a user's need. The purpose of this chapter is to describe the techniques that are at the core of today's search engines (such as Google, Bing, or Exalead), that is, mostly keyword search in very large collections of text documents. We also briefly touch upon other techniques and research issues that may be of importance in next-generation search engines.
This chapter is organized as follows. In Section 13.1, we briefly recall the Web and the languages and protocols it relies upon. Most of these topics have already been covered earlier in the book, and their introduction here is mostly intended to make the present chapter self-contained. We then present in Section 13.2 the techniques that can be used to retrieve pages from the Web, that is, to crawl it, and to extract text tokens from them. First-generation search engines, exemplified by Altavista, mostly relied on the classical information retrieval (IR) techniques, applied to text documents, that are described in Section 13.3. The advent of the Web, and more generally the steady growth of documents collections managed by institutions of all kinds, has led to extensions of these techniques. We address scalability issues in Section 13.3.3, with focus on centralized indexing. Distributed approaches are investigated in Chapter 14.
- Type
- Chapter
- Information
- Web Data Management , pp. 247 - 286Publisher: Cambridge University PressPrint publication year: 2011