Hostname: page-component-8448b6f56d-dnltx Total loading time: 0 Render date: 2024-04-24T14:40:35.616Z Has data issue: false hasContentIssue false

Review and comparison of Apriori algorithm implementations on Hadoop-MapReduce and Spark

Published online by Cambridge University Press:  11 July 2018

Eduardo P. S. Castro
Affiliation:
Department of Computer Science, Universidade Federal de Lavras, PO Box 3037, Lavras 37200-000, Brazil e-mail: eduardo.petrini@outlook.com, dinizthiagobr@gmail.com, marluce@dcc.ufla.br, ahmed@dcc.ufla.br, denilsonpereira@dcc.ufla.br
Thiago D. Maia
Affiliation:
Department of Computer Science, Universidade Federal de Lavras, PO Box 3037, Lavras 37200-000, Brazil e-mail: eduardo.petrini@outlook.com, dinizthiagobr@gmail.com, marluce@dcc.ufla.br, ahmed@dcc.ufla.br, denilsonpereira@dcc.ufla.br
Marluce R. Pereira
Affiliation:
Department of Computer Science, Universidade Federal de Lavras, PO Box 3037, Lavras 37200-000, Brazil e-mail: eduardo.petrini@outlook.com, dinizthiagobr@gmail.com, marluce@dcc.ufla.br, ahmed@dcc.ufla.br, denilsonpereira@dcc.ufla.br
Ahmed A. A. Esmin
Affiliation:
Department of Computer Science, Universidade Federal de Lavras, PO Box 3037, Lavras 37200-000, Brazil e-mail: eduardo.petrini@outlook.com, dinizthiagobr@gmail.com, marluce@dcc.ufla.br, ahmed@dcc.ufla.br, denilsonpereira@dcc.ufla.br
Denilson A. Pereira
Affiliation:
Department of Computer Science, Universidade Federal de Lavras, PO Box 3037, Lavras 37200-000, Brazil e-mail: eduardo.petrini@outlook.com, dinizthiagobr@gmail.com, marluce@dcc.ufla.br, ahmed@dcc.ufla.br, denilsonpereira@dcc.ufla.br

Abstract

Several Apriori algorithm implementations for mining association rules have been proposed in the literature using the Hadoop-MapReduce framework and, more recently, Spark. However, none of the works have made a detailed assessment of its performance, for example, comparing it with other implementations in various characteristics of data sets. In this work, we present a review of the main algorithms proposed for Hadoop-MapReduce and compared their implementations in a single environment under several different situations. Moreover, these algorithms had their implementations adapted to Spark, and also compared under the same circumstances. Based on the results of the experiments, we present a framework for recommending the Apriori implementation most appropriate for solving a given problem, according to the data set characteristics and minimum required support. The results show that Spark implementations overcome Hadoop-MapReduce implementations at runtime in most experiments. However, there is no single implementation that is the best in all the evaluated situations.

Type
Review Article
Copyright
© Cambridge University Press, 2018 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Agrawal, R., Imielinski, T. & Swami, A. 1993. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data 22, 207–216. ACM.Google Scholar
Agrawal, R. & Srikant, R. 1994. Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Data Bases, VLDB’94, 487–499. Morgan Kaufmann Publishers Inc.Google Scholar
Apache. 2016. What is Apache Hadoop. http://hadoop.apache.org/#What+Is+Apache+Hadoop, accessed January, 2016.Google Scholar
Apache Spark. 2016. Apache Spark lightning-fast cluster computing. http://spark.apache.org/, accessed January, 2016.Google Scholar
Apache Yarn. 2016. Apache Hadoop YARN. http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html, accessed January, 2016.Google Scholar
Dean, J. & Ghemawat, S. 2004. MapReduce: simplified data processing on large clusters. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation, 395–408. USENIX Association.Google Scholar
Farzanyar, Z. & Cercone, N. 2013a. Accelerating frequent itemsets mining on the cloud: a mapreduce-based approach. In Proceedings of the 14th IEEE Conference on Data Mining Workshops, ICDMW’13, 592–598. IEEE Computer Society.Google Scholar
Farzanyar, Z. & Cercone, N. 2013b. Efficient mining of frequent itemsets in social network data based on MapReduce framework. In Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining in ASONAM’13, 1183–1188. ACM.Google Scholar
Ghemawat, S., Gobioff, H. & Leung, S.-T. 2003. The Google File System. In Proceedings of the 19th ACM Symposium on Operating Systems Principles, SOSP’03, 37, 29–43. ACM.Google Scholar
Hahsler, M., Grun, B., Hornik, K. & Buchta, C. 2016. Introduction to arules: a computational environment for mining association rules and frequent itemsets. https://cran.r-project.org/web/packages/arules/vignettes/arules.pdf, accessed January, 2016.Google Scholar
Li, L. & Zhang, M. 2011. The strategy of mining association rule based on cloud computing. In Proceedings of the 2011 International Conference on Business Computing and Global Informatization in BCGIN’11, 475–478. IEEE Computer Society.Google Scholar
Li, N., Zeng, L., He, Q. & Shi, Z. 2012. Parallel implementation of apriori algorithm based on mapreduce. In Proceedings of the 13th Conference on Software Engineering, Artificial Intelligence, Networking and Parallel Distributed Computing, SNPD’12, 236–241. IEEE Computer Society.Google Scholar
Lin, M.-Y., Lee, P.-Y. & Hsueh, S.-C. 2012. Apriori-based frequent itemset mining algorithms on MapReduce. In Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication in ICUIMC’12, 1–8. ACM.Google Scholar
Oliveira, C. M. & Pereira, D. A. 2017. An association rules based method for classifying product offers from e-shopping. Intelligent Data Analysis 21(3), 637660.Google Scholar
Mazur, E., Li, B., Diao, Y., McGregor, A. & Shenoy, P. 2012. SCALLA: a platform for scalable one-pass analytics using MapReduce. ACM Transactions on Database Systems 37(4), 27.Google Scholar
Qiu, H., Gu, R., Yuan, C. & Huang, Y. 2014. Yafim: a parallel frequent itemset mining algorithm with Spark. In Proceedings of the 28th IEEE International Distributed Processing Symposium Workshops, IPDPSW’14, 1664–1671.Google Scholar
Rathee, S., Kaul, M. & Kashyap, A. 2015. R-Apriori: an efficient apriori based algorithm on Spark. In Proceedings of the 8th Workshop in Information and Knowledge Management, CIKM’15, 27–34. ACM.Google Scholar
SINTEF 2013. Big Data, for better or worse: 90% of world’s data generated over last two years. www.sciencedaily.com/releases/2013/05/130522085217.htm, accessed January 22, 2016.Google Scholar
Wedyan, S. 2014. Review and comparison of associative classification data mining approaches. International Journal of Computer, Electrical, Automation, Control and Information Engineering 8(1), 3445.Google Scholar
White, T. 2015. Hadoop: The Definitive Guide, 4th edition. O’Reilly Media.Google Scholar
Witten, I. H., Frank, E. & Hall, M. A. 2011. Data Mining – Practical Machine Learning Tools and Techniques. Morgan Kaufmann.Google Scholar
Yahya, O., Hegazy, O. & Ezat, E. 2012. An efficient implementation of Apriori algorithm based on Hadoop-MapReduce model. International Journal of Reviews in Computing 12, 5967.Google Scholar
Yang, X. Y., Liu, Z. & Fu, Y. 2010. MapReduce as a programming model for association rules algorithm on Hadoop. In Proceedings of the 3rd Conference on Information Sciences and Interaction Sciences, ICIS’10, 99–102. IEEE.Google Scholar
Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S. & Stoica, I. 2010. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing in HotCloud’10, 1–7. USENIX Association.Google Scholar
Zhou, X. & Huang, Y. 2014. An improved parallel association rules algorithm based on MapReduce framework for big data. In Proceedings of the 11th Conference on Fuzzy Systems and Knowledge Discovery, FSKD’14, 284–288. IEEE.Google Scholar