Skip to main content Accessibility help
×
Hostname: page-component-8448b6f56d-gtxcr Total loading time: 0 Render date: 2024-04-16T20:02:09.940Z Has data issue: false hasContentIssue false

1 - Data Mining

Published online by Cambridge University Press:  05 December 2014

Jure Leskovec
Affiliation:
Stanford University, California
Anand Rajaraman
Affiliation:
Milliways Laboratories, California
Jeffrey David Ullman
Affiliation:
Stanford University, California
Get access

Summary

In this intoductory chapter we begin with the essence of data mining and a discussion of how data mining is treated by the various disciplines that contribute to this field. We cover “Bonferroni's Principle,” which is really a warning about overusing the ability to mine data. This chapter is also the place where we summarize a few useful ideas that are not data mining but are useful in understanding some important data-mining concepts. These include the TF.IDF measure of word importance, behavior of hash functions and indexes, and identities involving e, the base of natural logarithms. Finally, we give an outline of the topics covered in the balance of the book.

What is Data Mining?

The most commonly accepted definition of “data mining” is the discovery of “models” for data. A “model,” however, can be one of several things. We mention below the most important directions in modeling.

1.1.1 Statistical Modeling

Statisticians were the first to use the term “data mining.” Originally, “data mining” or “data dredging” was a derogatory term referring to attempts to extract information that was not supported by the data. Section 1.2 illustrates the sort of errors one can make by trying to extract what really isn't in the data. Today, “data mining” has taken on a positive meaning. Now, statisticians view data mining as the construction of a statistical model, that is, an underlying distribution from which the visible data is drawn.

example 1.1 Suppose our data is a set of numbers. This data is much simpler than data that would be data-mined, but it will serve as an example. A statistician might decide that the data comes from a Gaussian distribution and use a formula to compute the most likely parameters of this Gaussian. The mean and standard deviation of this Gaussian distribution completely characterize the distribution and would become the model of the data. □

Type
Chapter
Information
Publisher: Cambridge University Press
Print publication year: 2014

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

[1] A., Broder, R., Kumar, F., Maghoul, P., Raghavan, S., Rajagopalan, R., Stata, A., Tomkins, and J., Weiner, “Graph structure in the web,” Computer Networks 33:1-6, pp. 309–320, 2000.Google Scholar
[2] M.M., Gaber, Scientific Data Mining and Knowledge Discovery – Principles and Foundations, Springer, New York, 2010.Google Scholar
[3] H., Garcia-Molina, J.D., Ullman, and J., Widom, Database Systems: The Complete Book Second Edition, Prentice-Hall, Upper Saddle River, NJ, 2009.Google Scholar
[4] D.E., Knuth, The Art of Computer Programming Vol. 3 (Sorting and Searching), Second Edition, Addison-Wesley, Upper Saddle River, NJ, 1998.Google Scholar
[5] C.P., Manning, P., Raghavan, and H., Schiitze, Introduction to Information Retrieval, Cambridge University Press, 2008.Google Scholar
[6] R.K., Merton, “The Matthew effect in science,” Science 159:3810, pp. 56–63, Jan. 5, 1968.Google Scholar
[7] P.-N., Tan, M., Steinbach, and V., Kumar, Introduction to Data Mining, Add-ison-Wesley, Upper Saddle River, NJ, 2005.Google Scholar

Save book to Kindle

To save this book to your Kindle, first ensure coreplatform@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats
×