Skip to main content Accessibility help

Property-Based Testing for Spark Streaming

  • A. RIESCO (a1) and J. RODRÍGUEZ-HORTALÁ (a2)

Stream processing has reached the mainstream in the last years, as a new generation of open-source distributed stream processing systems, designed for scaling horizontally on commodity hardware, has brought the capability for processing high-volume and high-velocity data streams to companies of all sizes. In this work, we propose a combination of temporal logic and property-based testing (PBT) for dealing with the challenges of testing programs that employ this programming model. We formalize our approach in a discrete time temporal logic for finite words, with some additions to improve the expressiveness of properties, which includes timeouts for temporal operators and a binding operator for letters. In particular, we focus on testing Spark Streaming programs written with the Spark API for the functional language Scala, using the PBT library ScalaCheck. For that we add temporal logic operators to a set of new ScalaCheck generators and properties, as part of our testing library sscheck.

Hide All
Akidau, T., Balikov, A., Bekiroğlu, K., Chernyak, S., Haberman, J., Lax, R., McVeety, S., Mills, D., Nordstrom, P. and Whittle, S. 2013. MillWheel: Fault-tolerant stream processing at internet scale. Proceedings of the VLDB Endowment 6, 11, 10331044.
Akidau, T., Bradshaw, R., Chambers, C., Chernyak, S., Fernández-Moctezuma, R. J., Lax, R., McVeety, S., Mills, D., Perry, F., Schmidt, E., et al. 2015. The dataflow model: A practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proceedings of the VLDB Endowment 8, 12, 17921803.
Alur, R. and Henzinger, T. A. 1994. A really temporal logic. Journal of the ACM 41, 1, 181204.
Barringer, H. and Havelund, K. 2011. Tracecontract: A scala DSL for trace analysis. In Proceedings of the 17th International Symposium on Formal Methods, FM 2011, vol. 6664, “emLecture Notes in Computer Science, Butler, M. J. and Schulte, W., Eds. Springer, Berlin Heidelberg, 5772.
Bauer, A., Leucker, M. and Schallhart, C. 2006. Monitoring of real-time properties. In FSTTCS 2006: Foundations of Software Technology and Theoretical Computer Science. Springer, Berlin Heidelberg, 260272.
Beck, K. 2003. Test-Driven Development: By Example. Addison-Wesley Professional, Boston, USA.
Blackburn, P., van Benthem, J. and Wolter, F., Eds. 2006. Handbook of Modal Logic. Elsevier, Amsterdam, the Netherlands.
Carbone, P., Ewen, S., Haridi, S., Katsifodimos, A., Markl, V. and Tzoumas, K. 2015a. Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 38, 4, 11.
Carbone, P., Fóra, G., Ewen, S., Haridi, S. and Tzoumas, K. 2015b. Lightweight asynchronous snapshots for distributed dataflows. arXiv preprint, arXiv:1506.08603.
Claessen, K. and Hughes, J. 2011. QuickCheck: A lightweight tool for random testing of Haskell programs. Acm Sigplan Notices 46, 4, 5364.
D’Angelo, B., Sankaranarayanan, S., Sánchez, C., Robinson, W., Finkbeiner, B., Sipma, H. B., Mehrotra, S. and Manna, Z. 2005. LOLA: Runtime monitoring of synchronous systems. In Proceedings of the 12th International Symposium on Temporal Representation and Reasoning, TIME 2005. IEEE Computer Society, 166174.
Fitting, M. and Mendelsohn, R. L. 1998. First-Order Modal Logic, vol. 277, Science & Business Media. Springer, Berlin Heidelberg.
Fowler, M. and Foemmel, M. 2006. Continuous integration. Thought-Works, Addison-Wesley, Boston, USA, 122.
Gorawski, M., Gorawska, A. and Pasterak, K. 2014. A survey of data stream processing tools. In Information Sciences and Systems 2014. Springer, Berlin Heidelberg, 295303.
Halbwachs, N. 1992. Synchronous programming of reactive systems, vol. 215, Springer International Series in Engineering and Computer Science. Kluwer Academic Publishers, Berlin Heidelberg.
Karau, H. 2015c. Effective testing of spark programs and jobs. In Strata + Hadoop World 2015 NYC. O’Reilly.
Karau, H. and Warren, R. 2017. High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark. O’Reilly Media, Incorporated, Missouri, USA.
Kuhn, R. and Allen, J. 2014. Reactive Design Patterns. Manning Publications Co. 2017, New York, USA.
Leucker, M. and Schallhart, C. 2009. A brief account of runtime verification. The Journal of Logic and Algebraic Programming 78, 5, 293303.
Marz, N. and Warren, J. 2015. Big Data: Principles and best practices of scalable realtime data systems. Manning Publications Co, New York, USA.
Neumeyer, L., Robbins, B., Nair, A. and Kesari, A. 2010. S4: Distributed stream computing platform. In 2010 IEEE International Conference on Data Mining Workshops (ICDMW). IEEE, 170177.
Nilsson, R. 2014. ScalaCheck: The Definitive Guide. IT Pro. Artima Incorporated, California, USA.
Papadakis, M. and Sagonas, K. 2011. A PropEr integration of types and function specifications with property-based testing. In Proceedings of the 10th ACM SIGPLAN workshop on Erlang. ACM, 3950.
Pnueli, A. 1986. Applications of temporal logic to the specification and verification of reactive systems: a survey of current trends. Springer, Berlin Heidelberg.
Ramasamy, K. 2015. Flying faster with twitter heron. The Official Twitter Blog.
Raymond, P., Roux, Y. and Jahier, E. 2008. Lutin: A language for specifying and executing reactive scenarios. EURASIP Journal on Embedded Systems 2008, 753821.
Riesco, A. and Rodríguez-Hortalá, J. 2015–2017a. Examples using sscheck.
Riesco, A. and Rodríguez-Hortalá, J. 2015–2017b. sscheck: Scalacheck for spark v0.3.2. See ScalaDoc documentation at, and basic setup instructions at
Riesco, A. and Rodríguez-Hortalá, J. 2016a. Property-based testing for Spark Streaming. In Apache Big Data Europe 2016. The Linux Foundation.
Riesco, A. and Rodríguez-Hortalá, J. 2016b. Temporal random testing for spark streaming. In Proceedings of the 12th International Conference on integrated Formal Methods, iFM 2016, vol. 9681, Lecture Notes in Computer Science, Abraham, E. and Huisman, M., Eds. Springer.
Riesco, A. and Rodríguez-Hortalá, J. 2018. Property-based testing for spark streaming (extended version). Technical Report 02/2018, Departamento de Sistemas Informáticos y Computación de la Universidad Complutense de Madrid, Berlin Heidelberg.
Smullyan, R. M. 1995. First-Order Logic. Courier Corporation.
Typesafe Inc. 2012. The magnet pattern.
Venners, B. 2015. Re: Prop.exists and scalatest matchers.!msg/scalacheck/Ped7joQLhnY/gNH0SSWkKUgJ.
White, T. 2012. Hadoop: The Definitive Guide. O’Reilly Media, Missouri, USA.
Wolper, P. 1983. Temporal logic can be more expressive. Information and Control 56, 1/2, 7299.
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S. and Stoica, I. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2.
Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S. and Stoica, I. 2013. Discretized streams: Fault-tolerant streaming computation at scale. In Proceedings of the 24th ACM Symposium on Operating Systems Principles. ACM, 423438.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Theory and Practice of Logic Programming
  • ISSN: 1471-0684
  • EISSN: 1475-3081
  • URL: /core/journals/theory-and-practice-of-logic-programming
Please enter your name
Please enter a valid email address
Who would you like to send this to? *


Type Description Title
Supplementary materials

Riesco and Rodríguez-Hortalá supplementary material
Riesco and Rodríguez-Hortalá supplementary material 1

 PDF (289 KB)
289 KB


Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed