Stream analytics: data pre-processing and transformation

Henrique C. M. Andrade; Buğra Gedik; Deepak S. Turaga

doi:10.1017/CBO9781139058940.012

10 - Stream analytics: data pre-processing and transformation

from Part IV - Application design and analytics

Published online by Cambridge University Press: 05 March 2014

Henrique C. M. Andrade ,

Buğra Gedik and

Deepak S. Turaga

Show author details

Henrique C. M. Andrade: Affiliation:
J. P. Morgan
Buğra Gedik: Affiliation:
Bilkent University, Ankara
Deepak S. Turaga: Affiliation:
IBM Thomas J. Watson Research Center, New York

Book contents

Get access

Summary

Overview

Continuous operation and data analysis are two of the distinguishing features of the stream processing paradigm. Arguably, the way in which SPAs employ analytics is what makes them invaluable to many businesses and scientific organizations.

In earlier chapters, we discussed how to architect and build a SPA to perform its analytical task efficiently. Yet, we haven't yet addressed any of the algorithmic issues surrounding the implementation of the analytical task itself.

Now that the stage is set and we have the knowledge and tools for engineering a high-performance SPA, we switch the focus to studying how existing stream processing and mining algorithms work, and how new ones can be designed. In the next two chapters, we will examine techniques drawn from data mining, machine learning, statistics, and other fields and show how they have been adapted and evolved to perform in the context of stream processing.

This chapter is organized as follows. The following two sections provide a conceptual introduction of the mining process in terms of its five broad steps: data acquisition, pre-processing, transformation, as well as modeling and evaluation (Section 10.2), followed by a description of the mathematical notation to be used when discussing specific algorithms (Section 10.3).

Since many issues associated with data acquisition were discussed in Section 9.2.1, in this chapter, we focus on the second and third steps, specifically on the techniques for data pre-processing and transformation.

Type: Chapter
Information: Fundamentals of Stream Processing
Application Design, Systems, and Analytics
, pp. 342 - 387

DOI: https://doi.org/10.1017/CBO9781139058940.012 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2014

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

[1] Joachims, T. Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms. Kluwer Academic Publishers; 2002.CrossRef Google Scholar

[2] Papoulis, A, editor. Probability, Random Variables, and Stochastic Processes. McGraw Hill; 1991.

[3] Ioannidis, YE. The history of histograms (abridged). In: Proceedings of the International Conference on Very Large Databases (VLDB). Berlin, Germany; 2003. pp. 19–30.Google Scholar

[4] Jagadish, H, Koudas, N, Muthukrishnan, S, Poosala, V, Sevcik, K, Suel, T. Optimal histograms with quality guarantees. In: Proceedings of the International Conference on Very Large Databases (VLDB). New York, NY; 1998. pp. 275–286.Google Scholar

[5] Hyndman, R, Fan, Y. Sample quantiles in statistical packages. The American Statistician. 1996;50(4):361–365.Google Scholar

[6] Devore, J. Probability and Statistics for Engineering and the Sciences. Brooks/Cole Publishing Company; 1995.Google Scholar

[7] Cover, TM, Thomas, JA. Elements of Information Theory. John Wiley & Sons, Inc and Interscience; 2006.Google Scholar

[8] Aggarwal, C, editor. Data Streams: Models and Algorithms. Springer; 2007.CrossRef

[9] Datar, M, Gionis, A, Indyk, P, Motwani, R. Maintaining stream statistics over sliding windows. SIAM Journal on Computing. 2002;31(6):1794–1813.CrossRef Google Scholar

[10] Babcock, B, Datar, M, Motwani, R, O'Callaghan, L. Maintaining variance and k-medians over data stream windows. In: Proceedings of the ACM Symposium on Principles of Database Systems (PODS). San Diego, CA; 2003. pp. 234–243.Google Scholar

[11] Arasu, A, Manku, G. Approximate counts and quantiles over sliding windows. In: Proceedings of the ACM Symposium on Principles of Database Systems (PODS). Paris, France; 2004. pp. 286–296.Google Scholar

[12] Chaudhuri, RM, Narasayya, V. Random sampling for histogram construction. How much is enough? In: Proceedings of the ACM International Conference on Management of Data (SIGMOD). Seattle, WA; 1998. pp. 436–447.Google Scholar

[13] Gibbons, YM, Poosala, V. Fast incremental maintenance of approximate histograms. In: Proceedings of the International Conference on Very Large Databases (VLDB). Athens, Greece; 1997. pp. 466–475.Google Scholar

[14] Percival, D, Walden, A. Spectral Analysis for Physical Applications. Cambridge University Press; 1993.CrossRef Google Scholar

[15] Brown, R, Hwang, P. Introduction to Random Signals and Applied Kalman Filtering. John Wiley & Sons, Inc.; 1996.

[16] Goodwin, P. The Holt–Winters approach to exponential smoothing: 50 years old and going strong. Foresight: The International Journal of Applied Forecasting. 2010;19:30–34.Google Scholar

[17] Wand, MP, Jones, MC. Kernel Smoothing. Chapman & Hall and CRC Press; 1995.CrossRef Google Scholar

[18] Pharr, M, Humphreys, G. Physically Based Rendering: From Theory to Implementation. Morgan Kaufmann; 2010.Google Scholar

[19] Ardilly, P, Tillé, Y. Sampling Methods. Springer; 2006.Google Scholar

[20] Ross, S. Simulation. 4th edn. Academic Press; 2006.Google Scholar

[21] Vitter, JS. Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS). 1985;11(1):37–57.CrossRef Google Scholar

[22] Babcock, B, Datar, M, Motwani, R. Sampling from a moving window over streaming data. In: Proceedings of the ACM/SIAM Symposium on Discrete Algorithms (SODA). San Francisco, CA; 2002. pp. 633–634.Google Scholar

[23] Candès, E, Wakin, M. An introduction to compressive sampling. IEEE Signal Processing Magazine. 2008;25(2):21–30.CrossRef Google Scholar

[24] Doucet, A, Freitas, N, Gordon, N. Sequential Monte Carlo Methods in Practice. Springer; 2001.CrossRef Google Scholar

[25] Thévenaz, P, Blu, T, Unser, M. Interpolation revisited. IEEE Transactions on Medical Imaging. 2000;19(7):739–758.CrossRef Google Scholar PubMed

[26] Babcock, B, Chaudhuri, S, Das, G. Dynamic sample selection for approximate query processing. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD). San Diego, CA; 2003. pp. 539–550.Google Scholar

[27] Chaudhuri, S, Das, G, Narasayya, V. Optimized stratified sampling for approximate query processing. ACM Transactions on Data Base Systems (TODS). 2007;32(2).Google Scholar

[28] Aggarwal, C. On biased reservoir sampling in the presence of stream evolution. In: Proceedings of the International Conference on Very Large Databases (VLDB). Seoul, Korea; 2006. pp. 607–618.Google Scholar

[29] Babcock, B, Datar, M, Motwani, R. Load shedding for aggregation queries over data streams. In: Proceedings of the IEEE International Conference on Data Engineering (ICDE). Boston, MA; 2004. pp. 350–361.Google Scholar

[30] Gibbons, P, Matias, Y. New sampling-based summary statistics for improving approximate query answers. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD). Seattle, WA; 1998. pp. 331–342.Google Scholar

[31] Cormode, G, Garofalakis, M, Haas, P, Jermaine, C. Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches. Now Publishing: Foundations and Trends in Databases Series; 2011.Google Scholar

[32] Charikar, M, Chen, K, Farach-Colton, M. Finding frequent items in data streams. In: Proceeding of the International Colloquium on Automata, Languages and Programming (ICALP). Malaga, Spain; 2002. pp. 693–703.Google Scholar

[33] Cormode, G, Muthukrishnan, S. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms. 2005;55(1):58–75.CrossRef Google Scholar

[34] Alon, N, Matias, Y, Szegedy, M. The space complexity of approximating the frequency moments. In: Proceedings of the ACM Symposium on the Theory of Computing (STOC). Philadelphia, PA; 1996. pp. 20–29.Google Scholar

[35] Flajolet, P, Martin, GN. Probabilistic counting algorithms for database applications. Journal of Computer and System Sciences. 1985;31(2):182–209.CrossRef Google Scholar

[36] Gibbons, P, Tirthapura, S. Estimating simple functions on the union of data streams. In: Proceedings of Symposium on Parallel Algorithms and Architectures (SPAA). Crete, Greece; 2001. pp. 281–291.Google Scholar

[37] Bar-Yossef, Z, Jayram, T, Kumar, R, Sivakumar, D, Trevisan, L. Counting distinct elements in a data stream. In: Proceedings of the International Workshop on Randomization and Approximation Techniques (RANDOM). Crete, Greece; 2002. pp. 1–10.Google Scholar

[38] Konheim, AG. Hashing in Computer Science: Fifty Years of Slicing and Dicing. John Wiley & Sons, Inc.; 2010.CrossRef Google Scholar

[39] Aigner, M, Ziegler, G. Proofs from the Book. Springer; 2000.Google Scholar

[40] Monemizadeh, M, Woodruff, DP. 1-pass relative-error lp-sampling with applications. In: Proceedings of the ACM/SIAM Symposium on Discrete Algorithms (SODA). Austin, TX; 2010. pp. 1143–1160.Google Scholar

[41] Gilbert, A, Guha, S, Indyk, P, Kotidis, Y, Muthukrishnan, S, Strauss, M. Fast, small-space algorithms for approximate histogram maintenance. In: Proceedings of the ACM Symposium on the Theory of Computing (STOC). Montreal, Canada; 2002. pp. 389–398.Google Scholar

[42] Gilbert, A, Kotidis, Y, Muthukrishnan, S, Strauss, M. Surfing wavelets on streams: one-pass summaries for approximate aggregate queries. In: Proceedings of the International Conference on Very Large Databases (VLDB). Rome, Italy; 2001. pp. 79–88.Google Scholar

[43] Ganguly, S, Majumder, A. CR-precis: a deterministic summary structure for update data streams. In: Proceedings of the International Symposium on Combinatorics (ESCAPE). Bucharest, Romania; 2007. pp. 48–59.Google Scholar

[44] Sayood, K. Introduction to Data Compression. Morgan Kaufmann; 2005.Google Scholar

[45] Tseng, IH, Verscheure, O, Turaga, DS, Chaudhari, UV. Quantization for adapted GMM-based speaker verification. In: Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Toulouse, France; 2006. pp. 653–656.Google Scholar

[46] Xu, R, Wunsch, D. Clustering. Series on Computational Intelligence. John Wiley & Sons, Inc and IEEE; 2009.Google Scholar

[47] Gersho, A, Gray, RM. Vector Quantization and Signal Compression. Kluwer Academic Publishers; 1991.Google Scholar

[48] Linde, Y, Buzo, A, Gray, R. An algorithm for vector quantizer design. IEEE Transactions on Communications. 1980;28:84–95.CrossRef Google Scholar

[49] Bagnall, AJ, Ratanamahatana, CA, Keogh, EJ, Lonardi, S, Janacek, GJ. A bit level representation for time series data mining with shape based similarity. Springer Data Mining and Knowledge Discovery. 2006;13(1):11–40.Google Scholar

[50] Aggarwal, CC, Han, J, Wang, J, Yu, PS. A framework for clustering evolving data streams. In: Proceedings of the International Conference on Very Large Databases (VLDB). Berlin, Germany; 2003. pp. 81–92.Google Scholar

[51] Delp, E, Saenz, M, Salama, P. Block truncation coding. In: Bovik, A, editor. The Handbook of Image and Video Processing. Academic Press; 2005. pp. 661–672.Google Scholar

[52] Freris, N, Vlachos, M, Turaga, D. Cluster-aware compression with provable K-means preservation. In: Proceedings of the SIAM Conference on Data Mining (SDM). Anaheim, CA; 2012. pp. 82–93.Google Scholar

[53] van der Schaar, M, Andreopoulos, Y. Rate-distortion-complexity modeling for network and receiver aware adaptation. IEEE Transactions on Multimedia (TMM). 2005;7(3):471-479.

[54] Kamilov, U, Goyal, VK, Rangan, S. Optimal quantization for compressive sensing under message passing reconstruction. In: Proceedings of the IEEE International Symposium on Information Theory (ISIT). Saint Petersburg, Russia; 2011. pp. 459–463.Google Scholar

[55] Boufounos, P. Universal rate-efficient scalar quantization. IEEE Transactions on Information Theory. 2012;58(3):1861–1872.CrossRef Google Scholar

[56] Lughofer, E. Extensions of vector quantization for incremental clustering. Pattern Recognition. 2008;41(3):995–1011.CrossRef Google Scholar

[57] Vapnik, V. Statistical Learning Theory. John Wiley & Sons, Inc.; 1998.Google Scholar

[58] Kira, K, Rendell, L. A practical approach to feature selection. In: Proceedings of the International Conference on Machine Learning (ICML). Aberdeen, Scotland; 1992. pp. 249–256.Google Scholar

[59] Witten, IH, Frank, E, Hall, MA, editors. Data Mining: Practical Machine Learning Tools and Techniques. 3rd edn. Morgan Kauffman; 2011.

[60] Kohavi, R, John, GH. Wrappers for feature subset selection. Artificial Intelligence. 1997;97(1):273–324.CrossRef Google Scholar

[61] Russel, S, Norvig, P. Artificial Intelligence: A Modern Approach. Prentice Hall; 2010.Google Scholar

[62] Gacek, A, Pedrycz, W. ECG Signal Processing, Classification and Interpretation: A Comprehensive Framework of Computational Intelligence. Springer; 2012.CrossRef Google Scholar

[63] Mallat, S. A Wavelet Tour of Signal Processing, The Sparse Way. Academic Press; 2009.Google Scholar

[64] Abdi, H, Williams, L. Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics. 2010;2(4):433–459.CrossRef Google Scholar

[65] Roweis, S. EM algorithms for PCA and SPCA. In: Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS). Denver, CO; 1997. pp. 626–632.Google Scholar

[66] Lee, J, Verleysen, M. Nonlinear Dimensionality Reduction. Springer; 2007.CrossRef Google Scholar

[67] Hoffmann, H. Kernel PCA for novelty detection. Pattern Recognition. 2007;40(3): 863–874.CrossRef Google Scholar

[68] Papadimitriou, S, Sun, J, Faloutsos, C. Streaming pattern discovery in multiple time-series. In: Proceedings of the International Conference on Very Large Databases (VLDB). Trondheim, Norway; 2005. pp. 697–708.Google Scholar

[69] Masaeli, M, Fung, G, Dy, J. From transformation-based dimensionality reduction to feature selection. In: Proceedings of the International Conference on Machine Learning (ICML). Haifa, Israel; 2010. pp. 751–758.Google Scholar

[70] Zhao, Z, Liu, H. Spectral feature selection for supervised and unsupervised learning. In: Proceedings of the International Conference on Machine Learning (ICML). Corvalis, OR; 2007. pp. 1151–1157.Google Scholar

[71] Song, L, Smola, A, Gretton, A, Borgwardt, K, Bedo, J. Supervised feature selection via dependence estimation. In: Proceedings of the International Conference on Machine Learning (ICML). Corvalis, OR; 2007. pp. 823–830.Google Scholar

[72] Gashler, M, Ventura, D, Martinez, T. Iterative non-linear dimensionality reduction with manifold sculpting. In: Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS). Vancouver, Canada; 2008. pp. 513–520.Google Scholar

[73] Rosman, G, Bronstein, M, Bronstein, A, Kimmel, R. Nonlinear dimensionality reduction by topologically constrained isometric embedding. International Journal of Computer Vision. 2010;89(1):56–68.CrossRef Google Scholar

[74] Lin, J, Vlachos, M, Keogh, E, Gunopulos, D. Iterative incremental clustering of data streams. In: Proceedings of the International Conference on Extending Database Technology (EDBT). Heraklion, Greece; 2004. pp. 106–122.Google Scholar

[75] Zhu, Y, Shasha, D. StatStream: statistical monitoring of thousands of data streams in realtime. In: Proceedings of the International Conference on Very Large Databases (VLDB). Hong Kong, China; 2002. pp. 358–369.Google Scholar

[76] Guha, S, Gunopulos, D, Koudas, N. Correlating synchronous and asynchronous data streams. In: Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (KDD). Washington DC; 2003. pp. 529–534.Google Scholar

[77] Yi, BK, Sidiropoulos, N, Johnson, T, Jagadish, HV, Faloutsos, C, Biliris, A. Online data mining for co-evolving time sequences. In: Proceedings of the IEEE International Conference on Data Engineering (ICDE). San Diego, CA; 2000. pp. 13–22.Google Scholar

[78] Sugiyama, M, Kawanabe, M, Chui, PL. Dimensionality reduction for density ratio estimation in high-dimensional spaces. Neural Networks. 2010;23(1):44–59.CrossRef Google Scholar PubMed

[79] Fang, J, Li, H. Optimal/near-optimal dimensionality reduction for distributed estimation in homogeneous and certain inhomogeneous scenarios. IEEE Transactions on Signal Processing (TSP). 2010;58(8):4339–4353.

[80] Marquardt, D. An algorithm for least-squares estimation of nonlinear parameters. Journal of the Society for Industrial and Applied Mathematics. 1963;11(2):431–441.CrossRef Google Scholar

[81] Wall, L. Programming Perl. 3rd edn. O'Reilly Media; 2000.Google Scholar

[82] Aho, A. The AWK Programming Language. Addison Wesley; 1988.Google Scholar

[83] Tidwell, D. XSLT. 2nd edn. O'Reilly Media; 2000.Google Scholar

[84] Taubman, D, Marcellin, M. JPEG2000 Image Compression, Fundamentals, Standard and Practice. Kluwer Academic Publishers; 2002.CrossRef Google Scholar

[85] Garofalakis, M, Gehrke, J, Rastogi, R. Querying and mining data streams: you only get one look (tutorial). In: Proceedings of the ACM International Conference on Management of Data (SIGMOD). Madison, WI; 2002. p. 635.Google Scholar

Book contents

10 - Stream analytics: data pre-processing and transformation

Summary

Access options

References

Save book to Kindle

Save book to Dropbox

Save book to Google Drive