Skip to main content Accessibility help
×
Home

Review and comparison of Apriori algorithm implementations on Hadoop-MapReduce and Spark

  • Eduardo P. S. Castro (a1), Thiago D. Maia (a1), Marluce R. Pereira (a1), Ahmed A. A. Esmin (a1) and Denilson A. Pereira (a1)...

Abstract

Several Apriori algorithm implementations for mining association rules have been proposed in the literature using the Hadoop-MapReduce framework and, more recently, Spark. However, none of the works have made a detailed assessment of its performance, for example, comparing it with other implementations in various characteristics of data sets. In this work, we present a review of the main algorithms proposed for Hadoop-MapReduce and compared their implementations in a single environment under several different situations. Moreover, these algorithms had their implementations adapted to Spark, and also compared under the same circumstances. Based on the results of the experiments, we present a framework for recommending the Apriori implementation most appropriate for solving a given problem, according to the data set characteristics and minimum required support. The results show that Spark implementations overcome Hadoop-MapReduce implementations at runtime in most experiments. However, there is no single implementation that is the best in all the evaluated situations.

Copyright

References

Hide All
Agrawal, R., Imielinski, T. & Swami, A. 1993. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data 22, 207–216. ACM.
Agrawal, R. & Srikant, R. 1994. Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Data Bases, VLDB’94, 487–499. Morgan Kaufmann Publishers Inc.
Apache. 2016. What is Apache Hadoop. http://hadoop.apache.org/#What+Is+Apache+Hadoop, accessed January, 2016.
Apache Spark. 2016. Apache Spark lightning-fast cluster computing. http://spark.apache.org/, accessed January, 2016.
Apache Yarn. 2016. Apache Hadoop YARN. http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html, accessed January, 2016.
Dean, J. & Ghemawat, S. 2004. MapReduce: simplified data processing on large clusters. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation, 395–408. USENIX Association.
Farzanyar, Z. & Cercone, N. 2013a. Accelerating frequent itemsets mining on the cloud: a mapreduce-based approach. In Proceedings of the 14th IEEE Conference on Data Mining Workshops, ICDMW’13, 592–598. IEEE Computer Society.
Farzanyar, Z. & Cercone, N. 2013b. Efficient mining of frequent itemsets in social network data based on MapReduce framework. In Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining in ASONAM’13, 1183–1188. ACM.
Ghemawat, S., Gobioff, H. & Leung, S.-T. 2003. The Google File System. In Proceedings of the 19th ACM Symposium on Operating Systems Principles, SOSP’03, 37, 29–43. ACM.
Hahsler, M., Grun, B., Hornik, K. & Buchta, C. 2016. Introduction to arules: a computational environment for mining association rules and frequent itemsets. https://cran.r-project.org/web/packages/arules/vignettes/arules.pdf, accessed January, 2016.
Li, L. & Zhang, M. 2011. The strategy of mining association rule based on cloud computing. In Proceedings of the 2011 International Conference on Business Computing and Global Informatization in BCGIN’11, 475–478. IEEE Computer Society.
Li, N., Zeng, L., He, Q. & Shi, Z. 2012. Parallel implementation of apriori algorithm based on mapreduce. In Proceedings of the 13th Conference on Software Engineering, Artificial Intelligence, Networking and Parallel Distributed Computing, SNPD’12, 236–241. IEEE Computer Society.
Lin, M.-Y., Lee, P.-Y. & Hsueh, S.-C. 2012. Apriori-based frequent itemset mining algorithms on MapReduce. In Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication in ICUIMC’12, 1–8. ACM.
Oliveira, C. M. & Pereira, D. A. 2017. An association rules based method for classifying product offers from e-shopping. Intelligent Data Analysis 21(3), 637660.
Mazur, E., Li, B., Diao, Y., McGregor, A. & Shenoy, P. 2012. SCALLA: a platform for scalable one-pass analytics using MapReduce. ACM Transactions on Database Systems 37(4), 27.
Qiu, H., Gu, R., Yuan, C. & Huang, Y. 2014. Yafim: a parallel frequent itemset mining algorithm with Spark. In Proceedings of the 28th IEEE International Distributed Processing Symposium Workshops, IPDPSW’14, 1664–1671.
Rathee, S., Kaul, M. & Kashyap, A. 2015. R-Apriori: an efficient apriori based algorithm on Spark. In Proceedings of the 8th Workshop in Information and Knowledge Management, CIKM’15, 27–34. ACM.
SINTEF 2013. Big Data, for better or worse: 90% of world’s data generated over last two years. www.sciencedaily.com/releases/2013/05/130522085217.htm, accessed January 22, 2016.
Wedyan, S. 2014. Review and comparison of associative classification data mining approaches. International Journal of Computer, Electrical, Automation, Control and Information Engineering 8(1), 3445.
White, T. 2015. Hadoop: The Definitive Guide, 4th edition. O’Reilly Media.
Witten, I. H., Frank, E. & Hall, M. A. 2011. Data Mining – Practical Machine Learning Tools and Techniques. Morgan Kaufmann.
Yahya, O., Hegazy, O. & Ezat, E. 2012. An efficient implementation of Apriori algorithm based on Hadoop-MapReduce model. International Journal of Reviews in Computing 12, 5967.
Yang, X. Y., Liu, Z. & Fu, Y. 2010. MapReduce as a programming model for association rules algorithm on Hadoop. In Proceedings of the 3rd Conference on Information Sciences and Interaction Sciences, ICIS’10, 99–102. IEEE.
Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S. & Stoica, I. 2010. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing in HotCloud’10, 1–7. USENIX Association.
Zhou, X. & Huang, Y. 2014. An improved parallel association rules algorithm based on MapReduce framework for big data. In Proceedings of the 11th Conference on Fuzzy Systems and Knowledge Discovery, FSKD’14, 284–288. IEEE.

Review and comparison of Apriori algorithm implementations on Hadoop-MapReduce and Spark

  • Eduardo P. S. Castro (a1), Thiago D. Maia (a1), Marluce R. Pereira (a1), Ahmed A. A. Esmin (a1) and Denilson A. Pereira (a1)...

Metrics

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed