Skip to main content Accessibility help
×
Hostname: page-component-848d4c4894-r5zm4 Total loading time: 0 Render date: 2024-07-02T22:04:05.939Z Has data issue: false hasContentIssue false

11 - Stream analytics: modeling and evaluation

from Part IV - Application design and analytics

Published online by Cambridge University Press:  05 March 2014

Henrique C. M. Andrade
Affiliation:
J. P. Morgan
Buğra Gedik
Affiliation:
Bilkent University, Ankara
Deepak S. Turaga
Affiliation:
IBM Thomas J. Watson Research Center, New York
Get access

Summary

Overview

In this chapter we focus on the last two stages of the data mining process and examine techniques for modeling and evaluation. In many ways, these steps form the core of a mining task where automatic or semi-automatic analysis of streaming data is used to extract insights and actionable models. This process employs algorithms specially designed for different purposes, such as the identification of similar groups of data, of unusual groups of data, or of related data, whose associations were previously unknown.

This chapter starts with a description of a methodology for offline modeling, where the model for a dataset is initially learned, and online evaluation, where this model is used to analyze the new data being processed by an application (Section 11.2). Despite the use of algorithms where a model is learned offline from previously stored training data, this methodology is frequently used in SPAs because it can leverage many of the existing data mining algorithms devised for analyzing datasets stored in databases and data warehouses.

Offline modeling is often sufficient for the analytical goals of many SPAs. Nevertheless, the use of online modeling and evaluation techniques allows a SPA to function autonomically and to evolve as a result of changes in the workload and in the data. Needless to say this is the goal envisioned by proponents of stream processing. Thus, in the rest of the chapter, we examine in detail online techniques for modeling or mining data streams.

Type
Chapter
Information
Fundamentals of Stream Processing
Application Design, Systems, and Analytics
, pp. 388 - 438
Publisher: Cambridge University Press
Print publication year: 2014

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

[1] Mathworks MATLAB; retrieved in April 2012. http://www.mathworks.com/.
[2] The R Project for Statistical Computing; retrieved in April 2012. http://www.r-project.org/.
[3] IBM SPSS Modeler; retrieved in March 2011. http://www.spss.com/software/modeler/.
[4] SAS Analytics; retrieved in June 2012. http://www.sas.com/technologies/analytics/.
[5] Weka Data Mining in Java; retrieved in December 2010. http://www.cs.waikato.ac.nz/ml/weka/.
[6] The Data Mining Group; retrieved in September 2010. http://www.dmg.org/.
[7] Guazzelli, A, Zeller, M, Chen, W, Williams, G. PMML: an open standard for sharing models. The R Journal. 2009;1(1):60–65.Google Scholar
[8] Streams Mining Toolkit; retrieved in July 2012. http://publib.boulder.ibm.com/infocenter/streams/v2r0/topic/com.ibm.swg.im.infosphere.streams.product.doc/doc/IBMInfoSphereStreams-MiningToolkit.pdf.
[9] Witten, IH, Frank, E, Hall, MA, editors. Data Mining: Practical Machine Learning Tools and Techniques 3rd edn. Morgan Kauffman; 2011.
[10] Aggarwal, C, editor. Data Streams: Models and Algorithms. Springer; 2007.CrossRef
[11] Schapire, RE, Singer, Y. Improved boosting algorithms using confidence-rated predictors. Machine Learning. 1999;37(3):297–336.CrossRefGoogle Scholar
[12] Polikar, R. Ensemble based systems in decision making. IEEE Circuits and Systems Magazine. 2006;6(3):21–45.CrossRefGoogle Scholar
[13] Domingos, P, Hulten, G. Mining high-speed data streams. In: Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (KDD). Boston, MA; 2000. pp. 71–80.Google Scholar
[14] Cover, TM, Thomas, JA. Elements of information theory. New York, NY: John Wiley & Sons, Inc and Interscience; 2006.
[15] Quinlan, JR, editor. C4.5: Programs for Machine Learning. Morgan Kaufmann; 1993.
[16] Gini, C. Variabilità e mutabilità: contributo allo studio delle distribuzioni e delle relazioni statistiche. Studi economico-giuridici; 1912. pp. 211–382.Google Scholar
[17] Breiman, L, Friedman, J, Olshen, R, Stone, C, editors. Classification and Regression Trees. Chapman and Hall; 1993.
[18] Decision Tree Course; retrieved in September 2012. http://www.cs.cmu.edu/~awm/tutorials.
[19] Hoeffding, W. Probability inequalities for sums of bounded random variables. Journal of American Statistical Association (JASA). 1963;58(301):13–30.Google Scholar
[20] Very Fast Machine Learning Library; retrieved in September 2012. http://www.cs.washington.edu/dm/vfml/.
[21] Hulten, G, Spencer, L, Domingos, P. Mining time changing data streams. In: Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (KDD). San Francisco, CA; 2001. pp. 97–106.Google Scholar
[22] Jin, R, Agrawal, G. Efficient decision tree construction on streaming data. In: Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (KDD). Washington, DC; 2003. pp. 571–576.Google Scholar
[23] Aggarwal, CC, Han, J, Wang, J, Yu, PS. On demand classification of data streams. In: Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (KDD). Seattle, WA; 2004. pp. 503–508.Google Scholar
[24] Xu, R, Wunsch, D. Clustering. Series on Computational Intelligence. John Wiley & Sons, Inc and IEEE; 2009.Google Scholar
[25] Joachims, T. Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms. Kluwer Academic Publishers; 2002.
[26] Gersho, A, Gray, RM. Vector Quantization and Signal Compression. Kluwer Academic Publishers; 1991.Google Scholar
[27] Arthur, D, Vassilvitskii, S. k-means++: the advantages of careful seeding. In: Proceedings of the ACM/SIAM Symposium on Discrete Algorithms (SODA). New Orleans, LA; 2007. pp. 179–196.Google Scholar
[28] Cover, T, Hart, P. Nearest neighbor pattern classification. IEEE Transactions on Information Theory. 1967;13(1):21–27.CrossRefGoogle Scholar
[29] Bishop, C, editor. Pattern Recognition and Machine Learning. Springer; 2006. Chapters 3 and 7.
[30] Hastie, T, Tibshirani, R, Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer; 2009.CrossRefGoogle Scholar
[31] Mahalanobis, PC. On the generalized distance in statistics. Proceedings of the Indian National Science Academy. 1936;2(1):49–55.Google Scholar
[32] Deza, M, Deza, E. Encyclopedia of Distances. Springer; 2009.CrossRefGoogle Scholar
[33] Ester, M, Kriegel, HP, Sander, J, Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (KDD). Portland, OR; 1996. pp. 226–231.Google Scholar
[34] Guha, S, Mishra, N, Motwani, R, O'Callaghan, L. Clustering data streams. In: Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS). Redondo Beach, CA; 2000. pp. 359–366.Google Scholar
[35] Aggarwal, CC, Han, J, Wang, J, Yu, PS. A framework for clustering evolving data streams. In: Proceedings of the International Conference on Very Large Databases (VLDB). Berlin, Germany; 2003. pp. 81–92.Google Scholar
[36] Zhang, T, Ramakrishnan, R, Livny, M. BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD). Montreal, Canada; 1996. pp. 103–114.Google Scholar
[37] Aggarwal, CC, Wolf, JL, Yu, PS, Procopiuc, C, Park, JS. Fast algorithms for projected clustering. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD). Philadelphia, PA; 1999. pp. 61–72.Google Scholar
[38] Aggarwal, CC, Han, J, Wang, J, Yu, PS. A framework for high dimensional projected clustering of data streams. In: Proceedings of the International Conference on Very Large Databases (VLDB). Toronto, Canada; 2004. pp. 852–863.Google Scholar
[39] Aggarwal, CC, Yu, PS. A framework for clustering uncertain data streams. In: Proceedings of the IEEE International Conference on Data Engineering (ICDE). Atlanta, GA; 2008. pp. 150–159.Google Scholar
[40] Aggarwal, CC, Yu, PS. A framework for clustering massive text and categorical data streams. In: Proceedings of the SIAM Conference on Data Mining (SDM). Bethesda, MD; 2006. pp. 479–483.Google Scholar
[41] Cheng, Y, Church, G. Biclustering of expression data. In: Proceedings of the International Conference on Intelligent Systems for Molecular Biology. Portland, OR; 2000. pp. 93–103.Google Scholar
[42] Mechelen, IV, Bock, H, Boeck, PD. Two-mode clustering methods:a structured overview. Statistical Methods in Medical Research. 2004;13(5):363–394.Google ScholarPubMed
[43] Shi, J, Malik, J. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). 2000;22(8):888–905.Google Scholar
[44] Bomze, I, Budinich, M, Pardalos, P, Pelillo, M. The maximum clique problem. Handbook of Combinatorial Optimization; 1999.CrossRefGoogle Scholar
[45] Nock, R, Nielsen, F. On weighting clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). 2006;28(8):1223–1235.CrossRefGoogle ScholarPubMed
[46] Fox, J, editor. Applied Regression Analysis, Linear Models, and Related Methods. SAGE Publications; 1997.
[47] Takezawa, K, editor. Introduction to Nonparametric Regression. John Wiley & Sons, Inc.; 2005.CrossRef
[48] McCune, B. Non-parametric habitat models with automatic interactions. Journal of Vegetation Science. 2006;17(6):819–830.CrossRefGoogle Scholar
[49] Gardner, WA. Learning characteristics of stochastic-gradient-descent algorithms: a general study, analysis, and critique. Signal Processing. 1984;6(2):113–133.CrossRefGoogle Scholar
[50] Duchi, J, Hazan, E, Singer, Y. An improved data stream summary: the count-min sketch and its applications. Journal of Machine Learning Research. 2010;12:2121–2159.Google Scholar
[51] McMahan, B, Streeter, M. Adaptive bound optimization for online convex optimization. In: Proceedings of the International Conference on Learning Theory (COLT). Haifa, Israel; 2010. pp. 244–256.Google Scholar
[52] Karampatziakis, N, Langford, J. Online importance weight aware updates. In: Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI). Barcelona, Spain; 2011. pp. 392–399.Google Scholar
[53] Goethals, B. Survey on Frequent Pattern Mining. Helsinki Institute for Information Technology Basic Research Unit; 2003.Google Scholar
[54] Agrawal, R, Imielinski, T, Swami, AN. Mining association rules between sets of items in large databases. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD). Washington, DC; 1993. pp. 207–216.Google Scholar
[55] Agrawal, R, Srikant, R. Fast algorithms for mining association rules in large databases. In: Proceedings of the International Conference on Very Large Databases (VLDB). Santiago, Chile; 1994. pp. 487–499.Google Scholar
[56] Zaki, MJ. Scalable algorithms for association mining. IEEE Transactions on Data and Knowledge Engineering (TKDE). 2000;12(3):372–390.CrossRefGoogle Scholar
[57] Han, J, Pei, J, Yin, Y, Mao, R. Mining frequent patterns without candidate generation. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD). Dallas, TX; 2000. pp. 1–12.Google Scholar
[58] Jin, R, Agrawal, G. An algorithm for in-core frequent itemset mining on streaming data. In: Proceedings of the IEEE International Conference on Data Mining (ICDM). Houston, TX, USA; 2005. pp. 201–217.Google Scholar
[59] Giannella, C, Han, J, Pei, J, Yan, X, Yu, P. Mining frequent patterns in data streams at multiple time granularities. In: Kargupta, H, Joshi, A, Sivakumar, K, Yesha, Y, editors. Data Mining: Next Generation Challenges and Future Directions. MIT Press; 2002. pp. 105–124.Google Scholar
[60] Manku, GS, Motwani, R. Approximate frequency counts over data streams. In: Proceedings of the International Conference on Very Large Databases (VLDB). Hong Kong, China; 2002. pp. 346–357.Google Scholar
[61] Chi, Y, Wang, H, Yu, PS, Muntz, RR. Moment: maintaining closed frequent itemsets over a stream sliding window. In: Proceedings of the IEEE International Conference on Data Mining (ICDM). Brighton, UK; 2004. pp. 59–66.Google Scholar
[62] Chang, JH, Lee, WS. Finding recent frequent itemsets adaptively over online data streams. In: Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (KDD). Washington, DC; 2003. pp. 487–492.Google Scholar
[63] Cheng, J, Ke, Y, Ng, W. A survey on algorithms for mining frequent itemsets over data streams. Knowledge and Information Systems. 2008;16(1):1–27.CrossRefGoogle Scholar
[64] Chandola, V, Banerjee, A, Kumar, V. Anomaly detection: a survey. ACM Computing Surveys. 2009;41(3).CrossRefGoogle Scholar
[65] Tax, D. One-class classification: concept-learning in the absence of counter-examples [Ph.D. Thesis]. Delft University of Technology; 2001.Google Scholar
[66] Devore, J. Probability and Statistics for Engineering and the Sciences. Brooks/Cole Publishing Company; 1995.Google Scholar
[67] Asadoorian, M, Kantarelis, D. Essentials of Inferential Statistics. 5th edn. University Press of America; 2008.Google Scholar
[68] Ihler, AT, Hutchins, J, Smyth, P. Adaptive event detection with time-varying Poisson processes. In: Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (KDD). Philadelphia, PA; 2006. pp. 207–216.Google Scholar
[69] Keogh, EJ, Lin, J, Lee, SH, Van Herle, H. Finding the most unusual time series subsequence: algorithms and applications. Knowledge and Information Systems. 2007;11(1):1–27.Google Scholar
[70] Zimmermann, J, Mohay, GM. Distributed intrusion detection in clusters based on noninterference. In: Proceedings of the 2006 Australasian Workshop on Grid Computing and E-Research. Hobart, Tasmania, Australia; 2006. pp. 89–95.Google Scholar
[71] Vaidya, J, Clifton, C. Privacy-preserving outlier detection. In: Proceedings of the IEEE International Conference on Data Mining (ICDM). Brighton, UK; 2004. pp. 233–240.Google Scholar
[72] Bao, HT. A distributed solution for privacy preserving outlier detection. In: Proceedings of the International Conference on Knowledge and Systems Engineering (KSE). Hanoi, Vietnam; 2011. pp. 26–31.Google Scholar

Save book to Kindle

To save this book to your Kindle, first ensure coreplatform@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats
×