Skip to main content Accessibility help
×
Hostname: page-component-76fb5796d-x4r87 Total loading time: 0 Render date: 2024-04-26T22:57:07.659Z Has data issue: false hasContentIssue false

10 - Stream analytics: data pre-processing and transformation

from Part IV - Application design and analytics

Published online by Cambridge University Press:  05 March 2014

Henrique C. M. Andrade
Affiliation:
J. P. Morgan
Buğra Gedik
Affiliation:
Bilkent University, Ankara
Deepak S. Turaga
Affiliation:
IBM Thomas J. Watson Research Center, New York
Get access

Summary

Overview

Continuous operation and data analysis are two of the distinguishing features of the stream processing paradigm. Arguably, the way in which SPAs employ analytics is what makes them invaluable to many businesses and scientific organizations.

In earlier chapters, we discussed how to architect and build a SPA to perform its analytical task efficiently. Yet, we haven't yet addressed any of the algorithmic issues surrounding the implementation of the analytical task itself.

Now that the stage is set and we have the knowledge and tools for engineering a high-performance SPA, we switch the focus to studying how existing stream processing and mining algorithms work, and how new ones can be designed. In the next two chapters, we will examine techniques drawn from data mining, machine learning, statistics, and other fields and show how they have been adapted and evolved to perform in the context of stream processing.

This chapter is organized as follows. The following two sections provide a conceptual introduction of the mining process in terms of its five broad steps: data acquisition, pre-processing, transformation, as well as modeling and evaluation (Section 10.2), followed by a description of the mathematical notation to be used when discussing specific algorithms (Section 10.3).

Since many issues associated with data acquisition were discussed in Section 9.2.1, in this chapter, we focus on the second and third steps, specifically on the techniques for data pre-processing and transformation.

Type
Chapter
Information
Fundamentals of Stream Processing
Application Design, Systems, and Analytics
, pp. 342 - 387
Publisher: Cambridge University Press
Print publication year: 2014

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

[1] Joachims, T. Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms. Kluwer Academic Publishers; 2002.CrossRefGoogle Scholar
[2] Papoulis, A, editor. Probability, Random Variables, and Stochastic Processes. McGraw Hill; 1991.
[3] Ioannidis, YE. The history of histograms (abridged). In: Proceedings of the International Conference on Very Large Databases (VLDB). Berlin, Germany; 2003. pp. 19–30.Google Scholar
[4] Jagadish, H, Koudas, N, Muthukrishnan, S, Poosala, V, Sevcik, K, Suel, T. Optimal histograms with quality guarantees. In: Proceedings of the International Conference on Very Large Databases (VLDB). New York, NY; 1998. pp. 275–286.Google Scholar
[5] Hyndman, R, Fan, Y. Sample quantiles in statistical packages. The American Statistician. 1996;50(4):361–365.Google Scholar
[6] Devore, J. Probability and Statistics for Engineering and the Sciences. Brooks/Cole Publishing Company; 1995.Google Scholar
[7] Cover, TM, Thomas, JA. Elements of Information Theory. John Wiley & Sons, Inc and Interscience; 2006.Google Scholar
[8] Aggarwal, C, editor. Data Streams: Models and Algorithms. Springer; 2007.CrossRef
[9] Datar, M, Gionis, A, Indyk, P, Motwani, R. Maintaining stream statistics over sliding windows. SIAM Journal on Computing. 2002;31(6):1794–1813.CrossRefGoogle Scholar
[10] Babcock, B, Datar, M, Motwani, R, O'Callaghan, L. Maintaining variance and k-medians over data stream windows. In: Proceedings of the ACM Symposium on Principles of Database Systems (PODS). San Diego, CA; 2003. pp. 234–243.Google Scholar
[11] Arasu, A, Manku, G. Approximate counts and quantiles over sliding windows. In: Proceedings of the ACM Symposium on Principles of Database Systems (PODS). Paris, France; 2004. pp. 286–296.Google Scholar
[12] Chaudhuri, RM, Narasayya, V. Random sampling for histogram construction. How much is enough? In: Proceedings of the ACM International Conference on Management of Data (SIGMOD). Seattle, WA; 1998. pp. 436–447.Google Scholar
[13] Gibbons, YM, Poosala, V. Fast incremental maintenance of approximate histograms. In: Proceedings of the International Conference on Very Large Databases (VLDB). Athens, Greece; 1997. pp. 466–475.Google Scholar
[14] Percival, D, Walden, A. Spectral Analysis for Physical Applications. Cambridge University Press; 1993.CrossRefGoogle Scholar
[15] Brown, R, Hwang, P. Introduction to Random Signals and Applied Kalman Filtering. John Wiley & Sons, Inc.; 1996.
[16] Goodwin, P. The Holt–Winters approach to exponential smoothing: 50 years old and going strong. Foresight: The International Journal of Applied Forecasting. 2010;19:30–34.Google Scholar
[17] Wand, MP, Jones, MC. Kernel Smoothing. Chapman & Hall and CRC Press; 1995.CrossRefGoogle Scholar
[18] Pharr, M, Humphreys, G. Physically Based Rendering: From Theory to Implementation. Morgan Kaufmann; 2010.Google Scholar
[19] Ardilly, P, Tillé, Y. Sampling Methods. Springer; 2006.Google Scholar
[20] Ross, S. Simulation. 4th edn. Academic Press; 2006.Google Scholar
[21] Vitter, JS. Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS). 1985;11(1):37–57.CrossRefGoogle Scholar
[22] Babcock, B, Datar, M, Motwani, R. Sampling from a moving window over streaming data. In: Proceedings of the ACM/SIAM Symposium on Discrete Algorithms (SODA). San Francisco, CA; 2002. pp. 633–634.Google Scholar
[23] Candès, E, Wakin, M. An introduction to compressive sampling. IEEE Signal Processing Magazine. 2008;25(2):21–30.CrossRefGoogle Scholar
[24] Doucet, A, Freitas, N, Gordon, N. Sequential Monte Carlo Methods in Practice. Springer; 2001.CrossRefGoogle Scholar
[25] Thévenaz, P, Blu, T, Unser, M. Interpolation revisited. IEEE Transactions on Medical Imaging. 2000;19(7):739–758.CrossRefGoogle ScholarPubMed
[26] Babcock, B, Chaudhuri, S, Das, G. Dynamic sample selection for approximate query processing. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD). San Diego, CA; 2003. pp. 539–550.Google Scholar
[27] Chaudhuri, S, Das, G, Narasayya, V. Optimized stratified sampling for approximate query processing. ACM Transactions on Data Base Systems (TODS). 2007;32(2).Google Scholar
[28] Aggarwal, C. On biased reservoir sampling in the presence of stream evolution. In: Proceedings of the International Conference on Very Large Databases (VLDB). Seoul, Korea; 2006. pp. 607–618.Google Scholar
[29] Babcock, B, Datar, M, Motwani, R. Load shedding for aggregation queries over data streams. In: Proceedings of the IEEE International Conference on Data Engineering (ICDE). Boston, MA; 2004. pp. 350–361.Google Scholar
[30] Gibbons, P, Matias, Y. New sampling-based summary statistics for improving approximate query answers. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD). Seattle, WA; 1998. pp. 331–342.Google Scholar
[31] Cormode, G, Garofalakis, M, Haas, P, Jermaine, C. Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches. Now Publishing: Foundations and Trends in Databases Series; 2011.Google Scholar
[32] Charikar, M, Chen, K, Farach-Colton, M. Finding frequent items in data streams. In: Proceeding of the International Colloquium on Automata, Languages and Programming (ICALP). Malaga, Spain; 2002. pp. 693–703.Google Scholar
[33] Cormode, G, Muthukrishnan, S. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms. 2005;55(1):58–75.CrossRefGoogle Scholar
[34] Alon, N, Matias, Y, Szegedy, M. The space complexity of approximating the frequency moments. In: Proceedings of the ACM Symposium on the Theory of Computing (STOC). Philadelphia, PA; 1996. pp. 20–29.Google Scholar
[35] Flajolet, P, Martin, GN. Probabilistic counting algorithms for database applications. Journal of Computer and System Sciences. 1985;31(2):182–209.CrossRefGoogle Scholar
[36] Gibbons, P, Tirthapura, S. Estimating simple functions on the union of data streams. In: Proceedings of Symposium on Parallel Algorithms and Architectures (SPAA). Crete, Greece; 2001. pp. 281–291.Google Scholar
[37] Bar-Yossef, Z, Jayram, T, Kumar, R, Sivakumar, D, Trevisan, L. Counting distinct elements in a data stream. In: Proceedings of the International Workshop on Randomization and Approximation Techniques (RANDOM). Crete, Greece; 2002. pp. 1–10.Google Scholar
[38] Konheim, AG. Hashing in Computer Science: Fifty Years of Slicing and Dicing. John Wiley & Sons, Inc.; 2010.CrossRefGoogle Scholar
[39] Aigner, M, Ziegler, G. Proofs from the Book. Springer; 2000.Google Scholar
[40] Monemizadeh, M, Woodruff, DP. 1-pass relative-error lp-sampling with applications. In: Proceedings of the ACM/SIAM Symposium on Discrete Algorithms (SODA). Austin, TX; 2010. pp. 1143–1160.Google Scholar
[41] Gilbert, A, Guha, S, Indyk, P, Kotidis, Y, Muthukrishnan, S, Strauss, M. Fast, small-space algorithms for approximate histogram maintenance. In: Proceedings of the ACM Symposium on the Theory of Computing (STOC). Montreal, Canada; 2002. pp. 389–398.Google Scholar
[42] Gilbert, A, Kotidis, Y, Muthukrishnan, S, Strauss, M. Surfing wavelets on streams: one-pass summaries for approximate aggregate queries. In: Proceedings of the International Conference on Very Large Databases (VLDB). Rome, Italy; 2001. pp. 79–88.Google Scholar
[43] Ganguly, S, Majumder, A. CR-precis: a deterministic summary structure for update data streams. In: Proceedings of the International Symposium on Combinatorics (ESCAPE). Bucharest, Romania; 2007. pp. 48–59.Google Scholar
[44] Sayood, K. Introduction to Data Compression. Morgan Kaufmann; 2005.Google Scholar
[45] Tseng, IH, Verscheure, O, Turaga, DS, Chaudhari, UV. Quantization for adapted GMM-based speaker verification. In: Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Toulouse, France; 2006. pp. 653–656.Google Scholar
[46] Xu, R, Wunsch, D. Clustering. Series on Computational Intelligence. John Wiley & Sons, Inc and IEEE; 2009.Google Scholar
[47] Gersho, A, Gray, RM. Vector Quantization and Signal Compression. Kluwer Academic Publishers; 1991.Google Scholar
[48] Linde, Y, Buzo, A, Gray, R. An algorithm for vector quantizer design. IEEE Transactions on Communications. 1980;28:84–95.CrossRefGoogle Scholar
[49] Bagnall, AJ, Ratanamahatana, CA, Keogh, EJ, Lonardi, S, Janacek, GJ. A bit level representation for time series data mining with shape based similarity. Springer Data Mining and Knowledge Discovery. 2006;13(1):11–40.Google Scholar
[50] Aggarwal, CC, Han, J, Wang, J, Yu, PS. A framework for clustering evolving data streams. In: Proceedings of the International Conference on Very Large Databases (VLDB). Berlin, Germany; 2003. pp. 81–92.Google Scholar
[51] Delp, E, Saenz, M, Salama, P. Block truncation coding. In: Bovik, A, editor. The Handbook of Image and Video Processing. Academic Press; 2005. pp. 661–672.Google Scholar
[52] Freris, N, Vlachos, M, Turaga, D. Cluster-aware compression with provable K-means preservation. In: Proceedings of the SIAM Conference on Data Mining (SDM). Anaheim, CA; 2012. pp. 82–93.Google Scholar
[53] van der Schaar, M, Andreopoulos, Y. Rate-distortion-complexity modeling for network and receiver aware adaptation. IEEE Transactions on Multimedia (TMM). 2005;7(3):471-479.
[54] Kamilov, U, Goyal, VK, Rangan, S. Optimal quantization for compressive sensing under message passing reconstruction. In: Proceedings of the IEEE International Symposium on Information Theory (ISIT). Saint Petersburg, Russia; 2011. pp. 459–463.Google Scholar
[55] Boufounos, P. Universal rate-efficient scalar quantization. IEEE Transactions on Information Theory. 2012;58(3):1861–1872.CrossRefGoogle Scholar
[56] Lughofer, E. Extensions of vector quantization for incremental clustering. Pattern Recognition. 2008;41(3):995–1011.CrossRefGoogle Scholar
[57] Vapnik, V. Statistical Learning Theory. John Wiley & Sons, Inc.; 1998.Google Scholar
[58] Kira, K, Rendell, L. A practical approach to feature selection. In: Proceedings of the International Conference on Machine Learning (ICML). Aberdeen, Scotland; 1992. pp. 249–256.Google Scholar
[59] Witten, IH, Frank, E, Hall, MA, editors. Data Mining: Practical Machine Learning Tools and Techniques. 3rd edn. Morgan Kauffman; 2011.
[60] Kohavi, R, John, GH. Wrappers for feature subset selection. Artificial Intelligence. 1997;97(1):273–324.CrossRefGoogle Scholar
[61] Russel, S, Norvig, P. Artificial Intelligence: A Modern Approach. Prentice Hall; 2010.Google Scholar
[62] Gacek, A, Pedrycz, W. ECG Signal Processing, Classification and Interpretation: A Comprehensive Framework of Computational Intelligence. Springer; 2012.CrossRefGoogle Scholar
[63] Mallat, S. A Wavelet Tour of Signal Processing, The Sparse Way. Academic Press; 2009.Google Scholar
[64] Abdi, H, Williams, L. Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics. 2010;2(4):433–459.CrossRefGoogle Scholar
[65] Roweis, S. EM algorithms for PCA and SPCA. In: Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS). Denver, CO; 1997. pp. 626–632.Google Scholar
[66] Lee, J, Verleysen, M. Nonlinear Dimensionality Reduction. Springer; 2007.CrossRefGoogle Scholar
[67] Hoffmann, H. Kernel PCA for novelty detection. Pattern Recognition. 2007;40(3): 863–874.CrossRefGoogle Scholar
[68] Papadimitriou, S, Sun, J, Faloutsos, C. Streaming pattern discovery in multiple time-series. In: Proceedings of the International Conference on Very Large Databases (VLDB). Trondheim, Norway; 2005. pp. 697–708.Google Scholar
[69] Masaeli, M, Fung, G, Dy, J. From transformation-based dimensionality reduction to feature selection. In: Proceedings of the International Conference on Machine Learning (ICML). Haifa, Israel; 2010. pp. 751–758.Google Scholar
[70] Zhao, Z, Liu, H. Spectral feature selection for supervised and unsupervised learning. In: Proceedings of the International Conference on Machine Learning (ICML). Corvalis, OR; 2007. pp. 1151–1157.Google Scholar
[71] Song, L, Smola, A, Gretton, A, Borgwardt, K, Bedo, J. Supervised feature selection via dependence estimation. In: Proceedings of the International Conference on Machine Learning (ICML). Corvalis, OR; 2007. pp. 823–830.Google Scholar
[72] Gashler, M, Ventura, D, Martinez, T. Iterative non-linear dimensionality reduction with manifold sculpting. In: Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS). Vancouver, Canada; 2008. pp. 513–520.Google Scholar
[73] Rosman, G, Bronstein, M, Bronstein, A, Kimmel, R. Nonlinear dimensionality reduction by topologically constrained isometric embedding. International Journal of Computer Vision. 2010;89(1):56–68.CrossRefGoogle Scholar
[74] Lin, J, Vlachos, M, Keogh, E, Gunopulos, D. Iterative incremental clustering of data streams. In: Proceedings of the International Conference on Extending Database Technology (EDBT). Heraklion, Greece; 2004. pp. 106–122.Google Scholar
[75] Zhu, Y, Shasha, D. StatStream: statistical monitoring of thousands of data streams in realtime. In: Proceedings of the International Conference on Very Large Databases (VLDB). Hong Kong, China; 2002. pp. 358–369.Google Scholar
[76] Guha, S, Gunopulos, D, Koudas, N. Correlating synchronous and asynchronous data streams. In: Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (KDD). Washington DC; 2003. pp. 529–534.Google Scholar
[77] Yi, BK, Sidiropoulos, N, Johnson, T, Jagadish, HV, Faloutsos, C, Biliris, A. Online data mining for co-evolving time sequences. In: Proceedings of the IEEE International Conference on Data Engineering (ICDE). San Diego, CA; 2000. pp. 13–22.Google Scholar
[78] Sugiyama, M, Kawanabe, M, Chui, PL. Dimensionality reduction for density ratio estimation in high-dimensional spaces. Neural Networks. 2010;23(1):44–59.CrossRefGoogle ScholarPubMed
[79] Fang, J, Li, H. Optimal/near-optimal dimensionality reduction for distributed estimation in homogeneous and certain inhomogeneous scenarios. IEEE Transactions on Signal Processing (TSP). 2010;58(8):4339–4353.
[80] Marquardt, D. An algorithm for least-squares estimation of nonlinear parameters. Journal of the Society for Industrial and Applied Mathematics. 1963;11(2):431–441.CrossRefGoogle Scholar
[81] Wall, L. Programming Perl. 3rd edn. O'Reilly Media; 2000.Google Scholar
[82] Aho, A. The AWK Programming Language. Addison Wesley; 1988.Google Scholar
[83] Tidwell, D. XSLT. 2nd edn. O'Reilly Media; 2000.Google Scholar
[84] Taubman, D, Marcellin, M. JPEG2000 Image Compression, Fundamentals, Standard and Practice. Kluwer Academic Publishers; 2002.CrossRefGoogle Scholar
[85] Garofalakis, M, Gehrke, J, Rastogi, R. Querying and mining data streams: you only get one look (tutorial). In: Proceedings of the ACM International Conference on Management of Data (SIGMOD). Madison, WI; 2002. p. 635.Google Scholar

Save book to Kindle

To save this book to your Kindle, first ensure coreplatform@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats
×