Stream analytics: modeling and evaluation

Henrique C. M. Andrade; Buğra Gedik; Deepak S. Turaga

doi:10.1017/CBO9781139058940.013

11 - Stream analytics: modeling and evaluation

from Part IV - Application design and analytics

Published online by Cambridge University Press: 05 March 2014

Henrique C. M. Andrade ,

Buğra Gedik and

Deepak S. Turaga

Show author details

Henrique C. M. Andrade: Affiliation:
J. P. Morgan
Buğra Gedik: Affiliation:
Bilkent University, Ankara
Deepak S. Turaga: Affiliation:
IBM Thomas J. Watson Research Center, New York

Book contents

Get access

Summary

Overview

In this chapter we focus on the last two stages of the data mining process and examine techniques for modeling and evaluation. In many ways, these steps form the core of a mining task where automatic or semi-automatic analysis of streaming data is used to extract insights and actionable models. This process employs algorithms specially designed for different purposes, such as the identification of similar groups of data, of unusual groups of data, or of related data, whose associations were previously unknown.

This chapter starts with a description of a methodology for offline modeling, where the model for a dataset is initially learned, and online evaluation, where this model is used to analyze the new data being processed by an application (Section 11.2). Despite the use of algorithms where a model is learned offline from previously stored training data, this methodology is frequently used in SPAs because it can leverage many of the existing data mining algorithms devised for analyzing datasets stored in databases and data warehouses.

Offline modeling is often sufficient for the analytical goals of many SPAs. Nevertheless, the use of online modeling and evaluation techniques allows a SPA to function autonomically and to evolve as a result of changes in the workload and in the data. Needless to say this is the goal envisioned by proponents of stream processing. Thus, in the rest of the chapter, we examine in detail online techniques for modeling or mining data streams.

Type: Chapter
Information: Fundamentals of Stream Processing
Application Design, Systems, and Analytics
, pp. 388 - 438

DOI: https://doi.org/10.1017/CBO9781139058940.013 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2014

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

[1] Mathworks MATLAB; retrieved in April 2012. http://www.mathworks.com/.

[2] The R Project for Statistical Computing; retrieved in April 2012. http://www.r-project.org/.

[3] IBM SPSS Modeler; retrieved in March 2011. http://www.spss.com/software/modeler/.

[4] SAS Analytics; retrieved in June 2012. http://www.sas.com/technologies/analytics/.

[5] Weka Data Mining in Java; retrieved in December 2010. http://www.cs.waikato.ac.nz/ml/weka/.

[6] The Data Mining Group; retrieved in September 2010. http://www.dmg.org/.

[7] Guazzelli, A, Zeller, M, Chen, W, Williams, G. PMML: an open standard for sharing models. The R Journal. 2009;1(1):60–65.Google Scholar

[8] Streams Mining Toolkit; retrieved in July 2012. http://publib.boulder.ibm.com/infocenter/streams/v2r0/topic/com.ibm.swg.im.infosphere.streams.product.doc/doc/IBMInfoSphereStreams-MiningToolkit.pdf.

[9] Witten, IH, Frank, E, Hall, MA, editors. Data Mining: Practical Machine Learning Tools and Techniques 3rd edn. Morgan Kauffman; 2011.

[10] Aggarwal, C, editor. Data Streams: Models and Algorithms. Springer; 2007.CrossRef

[11] Schapire, RE, Singer, Y. Improved boosting algorithms using confidence-rated predictors. Machine Learning. 1999;37(3):297–336.CrossRef Google Scholar

[12] Polikar, R. Ensemble based systems in decision making. IEEE Circuits and Systems Magazine. 2006;6(3):21–45.CrossRef Google Scholar

[13] Domingos, P, Hulten, G. Mining high-speed data streams. In: Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (KDD). Boston, MA; 2000. pp. 71–80.Google Scholar

[14] Cover, TM, Thomas, JA. Elements of information theory. New York, NY: John Wiley & Sons, Inc and Interscience; 2006.

[15] Quinlan, JR, editor. C4.5: Programs for Machine Learning. Morgan Kaufmann; 1993.

[16] Gini, C. Variabilità e mutabilità: contributo allo studio delle distribuzioni e delle relazioni statistiche. Studi economico-giuridici; 1912. pp. 211–382.Google Scholar

[17] Breiman, L, Friedman, J, Olshen, R, Stone, C, editors. Classification and Regression Trees. Chapman and Hall; 1993.

[18] Decision Tree Course; retrieved in September 2012. http://www.cs.cmu.edu/~awm/tutorials.

[19] Hoeffding, W. Probability inequalities for sums of bounded random variables. Journal of American Statistical Association (JASA). 1963;58(301):13–30.Google Scholar

[20] Very Fast Machine Learning Library; retrieved in September 2012. http://www.cs.washington.edu/dm/vfml/.

[21] Hulten, G, Spencer, L, Domingos, P. Mining time changing data streams. In: Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (KDD). San Francisco, CA; 2001. pp. 97–106.Google Scholar

[22] Jin, R, Agrawal, G. Efficient decision tree construction on streaming data. In: Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (KDD). Washington, DC; 2003. pp. 571–576.Google Scholar

[23] Aggarwal, CC, Han, J, Wang, J, Yu, PS. On demand classification of data streams. In: Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (KDD). Seattle, WA; 2004. pp. 503–508.Google Scholar

[24] Xu, R, Wunsch, D. Clustering. Series on Computational Intelligence. John Wiley & Sons, Inc and IEEE; 2009.Google Scholar

[25] Joachims, T. Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms. Kluwer Academic Publishers; 2002.

[26] Gersho, A, Gray, RM. Vector Quantization and Signal Compression. Kluwer Academic Publishers; 1991.Google Scholar

[27] Arthur, D, Vassilvitskii, S. k-means++: the advantages of careful seeding. In: Proceedings of the ACM/SIAM Symposium on Discrete Algorithms (SODA). New Orleans, LA; 2007. pp. 179–196.Google Scholar

[28] Cover, T, Hart, P. Nearest neighbor pattern classification. IEEE Transactions on Information Theory. 1967;13(1):21–27.CrossRef Google Scholar

[29] Bishop, C, editor. Pattern Recognition and Machine Learning. Springer; 2006. Chapters 3 and 7.

[30] Hastie, T, Tibshirani, R, Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer; 2009.CrossRef Google Scholar

[31] Mahalanobis, PC. On the generalized distance in statistics. Proceedings of the Indian National Science Academy. 1936;2(1):49–55.Google Scholar

[32] Deza, M, Deza, E. Encyclopedia of Distances. Springer; 2009.CrossRef Google Scholar

[33] Ester, M, Kriegel, HP, Sander, J, Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (KDD). Portland, OR; 1996. pp. 226–231.Google Scholar

[34] Guha, S, Mishra, N, Motwani, R, O'Callaghan, L. Clustering data streams. In: Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS). Redondo Beach, CA; 2000. pp. 359–366.Google Scholar

[35] Aggarwal, CC, Han, J, Wang, J, Yu, PS. A framework for clustering evolving data streams. In: Proceedings of the International Conference on Very Large Databases (VLDB). Berlin, Germany; 2003. pp. 81–92.Google Scholar

[36] Zhang, T, Ramakrishnan, R, Livny, M. BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD). Montreal, Canada; 1996. pp. 103–114.Google Scholar

[37] Aggarwal, CC, Wolf, JL, Yu, PS, Procopiuc, C, Park, JS. Fast algorithms for projected clustering. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD). Philadelphia, PA; 1999. pp. 61–72.Google Scholar

[38] Aggarwal, CC, Han, J, Wang, J, Yu, PS. A framework for high dimensional projected clustering of data streams. In: Proceedings of the International Conference on Very Large Databases (VLDB). Toronto, Canada; 2004. pp. 852–863.Google Scholar

[39] Aggarwal, CC, Yu, PS. A framework for clustering uncertain data streams. In: Proceedings of the IEEE International Conference on Data Engineering (ICDE). Atlanta, GA; 2008. pp. 150–159.Google Scholar

[40] Aggarwal, CC, Yu, PS. A framework for clustering massive text and categorical data streams. In: Proceedings of the SIAM Conference on Data Mining (SDM). Bethesda, MD; 2006. pp. 479–483.Google Scholar

[41] Cheng, Y, Church, G. Biclustering of expression data. In: Proceedings of the International Conference on Intelligent Systems for Molecular Biology. Portland, OR; 2000. pp. 93–103.Google Scholar

[42] Mechelen, IV, Bock, H, Boeck, PD. Two-mode clustering methods:a structured overview. Statistical Methods in Medical Research. 2004;13(5):363–394.Google Scholar PubMed

[43] Shi, J, Malik, J. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). 2000;22(8):888–905.Google Scholar

[44] Bomze, I, Budinich, M, Pardalos, P, Pelillo, M. The maximum clique problem. Handbook of Combinatorial Optimization; 1999.CrossRef Google Scholar

[45] Nock, R, Nielsen, F. On weighting clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). 2006;28(8):1223–1235.CrossRef Google Scholar PubMed

[46] Fox, J, editor. Applied Regression Analysis, Linear Models, and Related Methods. SAGE Publications; 1997.

[47] Takezawa, K, editor. Introduction to Nonparametric Regression. John Wiley & Sons, Inc.; 2005.CrossRef

[48] McCune, B. Non-parametric habitat models with automatic interactions. Journal of Vegetation Science. 2006;17(6):819–830.CrossRef Google Scholar

[49] Gardner, WA. Learning characteristics of stochastic-gradient-descent algorithms: a general study, analysis, and critique. Signal Processing. 1984;6(2):113–133.CrossRef Google Scholar

[50] Duchi, J, Hazan, E, Singer, Y. An improved data stream summary: the count-min sketch and its applications. Journal of Machine Learning Research. 2010;12:2121–2159.Google Scholar

[51] McMahan, B, Streeter, M. Adaptive bound optimization for online convex optimization. In: Proceedings of the International Conference on Learning Theory (COLT). Haifa, Israel; 2010. pp. 244–256.Google Scholar

[52] Karampatziakis, N, Langford, J. Online importance weight aware updates. In: Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI). Barcelona, Spain; 2011. pp. 392–399.Google Scholar

[53] Goethals, B. Survey on Frequent Pattern Mining. Helsinki Institute for Information Technology Basic Research Unit; 2003.Google Scholar

[54] Agrawal, R, Imielinski, T, Swami, AN. Mining association rules between sets of items in large databases. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD). Washington, DC; 1993. pp. 207–216.Google Scholar

[55] Agrawal, R, Srikant, R. Fast algorithms for mining association rules in large databases. In: Proceedings of the International Conference on Very Large Databases (VLDB). Santiago, Chile; 1994. pp. 487–499.Google Scholar

[56] Zaki, MJ. Scalable algorithms for association mining. IEEE Transactions on Data and Knowledge Engineering (TKDE). 2000;12(3):372–390.CrossRef Google Scholar

[57] Han, J, Pei, J, Yin, Y, Mao, R. Mining frequent patterns without candidate generation. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD). Dallas, TX; 2000. pp. 1–12.Google Scholar

[58] Jin, R, Agrawal, G. An algorithm for in-core frequent itemset mining on streaming data. In: Proceedings of the IEEE International Conference on Data Mining (ICDM). Houston, TX, USA; 2005. pp. 201–217.Google Scholar

[59] Giannella, C, Han, J, Pei, J, Yan, X, Yu, P. Mining frequent patterns in data streams at multiple time granularities. In: Kargupta, H, Joshi, A, Sivakumar, K, Yesha, Y, editors. Data Mining: Next Generation Challenges and Future Directions. MIT Press; 2002. pp. 105–124.Google Scholar

[60] Manku, GS, Motwani, R. Approximate frequency counts over data streams. In: Proceedings of the International Conference on Very Large Databases (VLDB). Hong Kong, China; 2002. pp. 346–357.Google Scholar

[61] Chi, Y, Wang, H, Yu, PS, Muntz, RR. Moment: maintaining closed frequent itemsets over a stream sliding window. In: Proceedings of the IEEE International Conference on Data Mining (ICDM). Brighton, UK; 2004. pp. 59–66.Google Scholar

[62] Chang, JH, Lee, WS. Finding recent frequent itemsets adaptively over online data streams. In: Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (KDD). Washington, DC; 2003. pp. 487–492.Google Scholar

[63] Cheng, J, Ke, Y, Ng, W. A survey on algorithms for mining frequent itemsets over data streams. Knowledge and Information Systems. 2008;16(1):1–27.CrossRef Google Scholar

[64] Chandola, V, Banerjee, A, Kumar, V. Anomaly detection: a survey. ACM Computing Surveys. 2009;41(3).CrossRef Google Scholar

[65] Tax, D. One-class classification: concept-learning in the absence of counter-examples [Ph.D. Thesis]. Delft University of Technology; 2001.Google Scholar

[66] Devore, J. Probability and Statistics for Engineering and the Sciences. Brooks/Cole Publishing Company; 1995.Google Scholar

[67] Asadoorian, M, Kantarelis, D. Essentials of Inferential Statistics. 5th edn. University Press of America; 2008.Google Scholar

[68] Ihler, AT, Hutchins, J, Smyth, P. Adaptive event detection with time-varying Poisson processes. In: Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (KDD). Philadelphia, PA; 2006. pp. 207–216.Google Scholar

[69] Keogh, EJ, Lin, J, Lee, SH, Van Herle, H. Finding the most unusual time series subsequence: algorithms and applications. Knowledge and Information Systems. 2007;11(1):1–27.Google Scholar

[70] Zimmermann, J, Mohay, GM. Distributed intrusion detection in clusters based on noninterference. In: Proceedings of the 2006 Australasian Workshop on Grid Computing and E-Research. Hobart, Tasmania, Australia; 2006. pp. 89–95.Google Scholar

[71] Vaidya, J, Clifton, C. Privacy-preserving outlier detection. In: Proceedings of the IEEE International Conference on Data Mining (ICDM). Brighton, UK; 2004. pp. 233–240.Google Scholar

[72] Bao, HT. A distributed solution for privacy preserving outlier detection. In: Proceedings of the International Conference on Knowledge and Systems Engineering (KSE). Hanoi, Vietnam; 2011. pp. 26–31.Google Scholar

Book contents

11 - Stream analytics: modeling and evaluation

Summary

Access options

References

Save book to Kindle

Save book to Dropbox

Save book to Google Drive