Skip to main content Accessibility help
×
Hostname: page-component-848d4c4894-pjpqr Total loading time: 0 Render date: 2024-06-27T05:48:42.701Z Has data issue: false hasContentIssue false

5 - Link Analysis

Published online by Cambridge University Press:  05 December 2014

Jure Leskovec
Affiliation:
Stanford University, California
Anand Rajaraman
Affiliation:
Milliways Laboratories, California
Jeffrey David Ullman
Affiliation:
Stanford University, California
Get access

Summary

One of the biggest changes in our lives in the decade following the turn of the century was the availability of efficient and accurate Web search, through search engines such as Google. While Google was not the first search engine, it was the first able to defeat the spammers who had made search almost useless. Moreover, the innovation provided by Google was a nontrivial technological advance, called “PageRank.” We shall begin the chapter by explaining what PageRank is and how it is computed efficiently.

Yet the war between those who want to make the Web useful and those who would exploit it for their own purposes is never over. When PageRank was established as an essential technique for a search engine, spammers invented ways to manipulate the PageRank of a Web page, often called link spam. That development led to the response of TrustRank and other techniques for preventing spammers from attacking PageRank. We shall discuss TrustRank and other approaches to detecting link spam.

Finally, this chapter also covers some variations on PageRank. These techniques include topic-sensitive PageRank (which can also be adapted for combating link spam) and the HITS, or “hubs and authorities” approach to evaluating pages on the Web.

PageRank

We begin with a portion of the history of search engines, in order to motivate the definition of PageRank, a tool for evaluating the importance of Web pages in a way that it is not easy to fool. We introduce the idea of “random surfers,” to explain why PageRank is effective. We then introduce the technique of “taxation” or recycling of random surfers, in order to avoid certain Web structures that present problems for the simple version of PageRank.

5.1.1. Early Search Engines and Term Spam

There were many search engines before Google. Largely, they worked by crawling the Web and listing the terms (words or other strings of characters other than white space) found in each page, in an inverted index.

Type
Chapter
Information
Publisher: Cambridge University Press
Print publication year: 2014

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

[1] S., Brin and L., Page, “Anatomy of a large-scale hypertextual web search engine,” Proc. 7th Intl. World-Wide-Web Conference, pp. 107–117, 1998.Google Scholar
[2] A., Broder, R., Kumar, F., Maghoul, P., Raghavan, S., Rajagopalan, R., Stata, A., Tomkins, and J., Weiner, “Graph structure in the web,” Computer Networks 33:1–6, pp. 309–320, 2000.Google Scholar
[3] Z., Gyongi, P., Berkhin, H., Garcia-Molina, and J., Pedersen, “Link spam de-tection based on mass estimation,” Proc. 32nd Intl. Conf. on Very Large Databases, pp. 439–450, 2006.Google Scholar
[4] Z., Gyongi, H., Garcia-Molina, and J., Pedersen, “Combating link spam with trustrank,” Proc. 30th Intl. Conf. on Very Large Databases, pp. 576–587, 2004.Google Scholar
[5] T.H., Haveliwala, “Efficient computation of PageRank,” Stanford Univ. Dept. of Computer Science technical report, Sept., 1999. Available as http://infolab.stanford.edu/~taherh/papers/efficient-pr.pdfGoogle Scholar
[6] T.H., Haveliwala, “Topic-sensitive PageRank,” Proc. 11th Intl. World-Wide-Web Conference, pp. 517–526, 2002Google Scholar
[7] J.M., Kleinberg, “Authoritative sources in a hyperlinked environment,” J. ACM 46:5, pp. 604–632, 1999.Google Scholar

Save book to Kindle

To save this book to your Kindle, first ensure coreplatform@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats
×