Skip to main content Accessibility help

Recent Contributions of Data Mining to Language Learning Research

  • Mark Warschauer (a1), Soobin Yim (a2), Hansol Lee (a3) and Binbin Zheng (a4)


This paper will review the role of data mining in research on second language learning. Following a general introduction to the topic, three areas of data mining research will be summarized—clustering techniques, text-mining, and social network analysis—with examples from both the broader field and studies conducted by the authors. The application of data mining in second language learning research is relatively new, and more theoretical and empirical support is needed in the appropriate collection, use, and interpretation of data for specific research and pedagogical objectives. The three examples that we introduce illustrate how new data sources accessible in online environments can be analyzed to better understand the optimal instructional context for corpus-based vocabulary learning (clustering technique), characteristics and patterns of collaborative written interaction using Google Docs (text mining and visualizations), and issues of access and community in computer-mediated discussion (social network analysis). Implications of these new techniques for L2 research will be discussed.


Corresponding author

*Corresponding author. E-mail:


Hide All
Attewell, P., Monaghan, D. B., & Kwong, D. (2015). Data mining for the social sciences: An introduction. Oakland, CA: University of California Press.
Behrens, J. T., & DiCerbo, K. E. (2014). Harnessing the currents of the digital ocean. In Larusson, J. A. & White, B. (Eds.), Learning analytics: From research to practice (pp. 3960). New York, NY: Springer.
Bergman, L. R., Magnusson, D., & El Khouri, B. M. (2003). Studying individual development in an interindividual context: A person-oriented approach. New York, NY: Psychology Press.
Biber, D., & Reppen, R. (Eds.). (2015). The Cambridge handbook of English corpus linguistics. Cambridge, UK: Cambridge University Press.
Borgatti, S. P., Everett, M. G., & Freeman, L. C. (2002). UCINET for Windows: Software for social network analysis. Cambridge, MA: Analytic Technologies.
Bronfenbrenner, U., & Morris, P. A. (2006). The bioecological model of human development. In Damon, W. & Lerner, R. M. (Eds.), Handbook of child psychology (pp. 793828). Hoboken, NJ: Wiley.
Calderon, O., & Sood, C. (2018). Evaluating learning outcomes of an asynchronous online discussion assignment: A post-priori content analysis. Interactive Learning Environments, 1–15.
Carico, K. M., & Logan, D. (2004). A generation in cyberspace: Engaging readers through online discussions. Language Arts, 81(4), 293302.
Carolan, B. (2014). Social network analysis and education. Thousand Oaks, CA: SAGE.
Chapelle, C. A. (2007). Technology and second language acquisition. Annual Review of Applied Linguistics, 27, 98114.
Chapelle, C. A., Cotos, E., & Lee, J. (2015). Validity arguments for diagnostic assessment using automated writing evaluation. Language Testing, 32(3), 385405.
Charrad, M., Ghazzali, N., Boiteau, V., & Niknafs, A. (2014). NbClust: An R package for determining the relevant number of clusters in a data set. Journal of Statistical Software, 61(6), 136.
Chin, A., & Chignell, M. (2007). Identifying communities in blogs: roles for social network analysis and survey instruments. International Journal of Web Based Communities, 3(3), 345363.
Chiu, Mi. M., & Fujita, N. (2014). Statistical Discourse Analysis: A method for modeling online discussion processes. Journal of Learning Analytics, 1(3), 6183.
Chun, D. M. (2013). Contributions of tracking user behavior to SLA research. CALICO Journal, 30, 256262.
Cope, B., & Kalantzis, M. (2016). Big data comes to school: Implications for learning, assessment, and research. AERA Open, 2(2), 2332858416641907.
Crossley, S. A., Greenfield, J., & McNamara, D. S. (2008). Assessing text readability using cognitively based indices. TESOL Quarterly, 42(3), 475493.
Crossley, S. A., & McNamara, D. S. (2009). Computational assessment of lexical differences in L1 and L2 writing. Journal of Second Language Writing, 18(2), 119135.
Crossley, S. A., Kyle, K., & McNamara, D. S. (2016). The development and use of cohesive devices in L2 writing and their relations to judgments of essay quality. Journal of Second Language Writing, 32, 116.
Crossley, S. A., & McNamara, D. S. (2014). Does writing development equal writing quality? A computational investigation of syntactic complexity in L2 learners. Journal of Second Language Writing, 26, 6679.
Csizér, K., & Dörnyei, Z. (2005). Language learners’ motivational profiles and their motivated learning behavior. Language Learning, 55(4), 613659.
de Nooy, W., Mrvar, A., & Batagelj, V. (2005). Exploratory social network analysis with Pajek. New York, NY: Cambridge University Press.
Dörnyei, Z. (2007). Research methods in applied linguistics: Quantitative, qualitative, and mixed methodologies. Oxford: Oxford University Press.
Dörnyei, Z. (2009). Individual differences: Interplay of learner characteristics and learning environment. Language Learning, 59, 230248.
Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI magazine, 17(3), 37.
Feldman, R., & Sanger, J. (2006). Information extraction. The text mining handbook: Advanced approaches in analyzing unstructured data (pp. 94130). Cambridge, UK: Cambridge University Press.
Fraley, C., Raftery, A. E., Scrucca, L., Murphy, T. B., & Fop, M. (2017). mclust: Gaussian mixture modelling for model-based clustering, classification, and density estimation (R package version 5.3).
Gabbriellini, S. (2014). The evolution of online forums as communication networks: An agent-based model. Revue française de sociologie, 55(4), 805826.
Garrett, N. (1991). Technology in the service of language learning: Trends and issues. Modern Language Journal, 75(1), 74101.
Godwin-Jones, R. (2017). Scaling up and zooming in: Big data and personalization in language learning. Language Learning & Technology, 21(1), 415.
Grisham, D. L., & Wolsey, T. D. (2006). Recentering the middle school classroom as a vibrant learning community: Students, literacy, and technology intersect. Journal of Adolescent & Adult Literacy, 49, 648660.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning. New York, NY: Springer.
Johnson, L., Smith, R., Willis, H., Levine, A., & Haywood, K. (2011). The 2011 horizon report. Austin, TX: The New Media Consortium. Retrieved July 9, 2013.
Kern, R. G. (1995). Restructuring classroom interaction with networked computers: Effects on quantity and characteristics of language production. The Modern Language Journal, 79(4), 457476.
Kern, R., & Warschauer, M. (2000). Theory and practice of network-based language teaching. In Warschauer, M. & Kern, R. (Eds.), Network-based language teaching: Concepts and practice (pp. 119). New York, NY: Cambridge University Press.
Kojic-Sabo, I., & Lightbown, P. M. (1999). Students’ approaches to vocabulary learning and their relationship to success. The Modern Language Journal, 83(2), 176192.
Lankshear, C., & Knobel, M. (2007). Sampling “the new” in new literacies. A New Literacies Sampler, 29, 124.
Lee, H., Warschauer, M., & Lee, J. H. (2017). The effects of concordance-based electronic glosses on L2 vocabulary learning. Language Learning & Technology, 21(2), 3251.
Lee, H., Warschauer, M., & Lee, J. H. (2018). The effects of corpus use on second language vocabulary learning: A multilevel meta-analysis. Applied Linguistics.
Lee, H., Warschauer, M., & Lee, J. H. (2019). Advancing CALL research via data mining techniques: Unearthing hidden groups of learners in a corpus-based L2 vocabulary learning experiment. ReCALL, 31(2), 135149.
Leijten, M., & Van Waes, L. (2013). Keystroke logging in writing research: Using Inputlog to analyze and visualize writing processes. Written Communication, 30(3), 358392.
Lindgren, E., & Sullivan, K. P. (2002). The LS graph: A methodology for visualizing writing revision. Language Learning, 52(3), 565595.
Link, S., & Li, Z. (2015). Understanding online interaction through learning analytics: Defining a theory-based research agenda. In Dixon, E. & Thomas, M. (Eds.), Researching language learner interactions online: From social media to MOOCs (pp. 369385). San Marcos, TX: Texas State University.
Liu, S., & Kunnan, A. J. (2016). Investigating the application of automated writing evaluation to Chinese undergraduate English majors: A case study of WriteToLearn. Calico Journal, 33(1).
McNamara, D. S., Ozuru, Y., Graesser, A. C., & Louwerse, M. (2006). Validating Coh-Metrix. In Sun, R. & Miyake, N. (Eds.), Proceedings of the 28th annual conference of the cognitive science society (pp. 573578). Mahwah, NJ: Erlbaum.
Means, B., Bakia, M., & Murphy, R. (2014). Learning online: What research tells us about whether, when and how. New York, NY: Routledge.
Meilă, M., & Heckerman, D. (2001). An experimental comparison of model-based clustering methods. Machine Learning, 42(1/2), 929.
Miller, K. S., Lindgren, E., & Sullivan, K. P. (2008). The psycholinguistic dimension in second language writing: Opportunities for research and pedagogy using computer keystroke logging. TESOL Quarterly, 42(3), 433454.
Papi, M., & Teimouri, Y. (2014). Language learner motivational types: A cluster analysis study. Language Learning, 64(3), 493525.
Peck, S. C., Vida, M., & Eccles, J. S. (2008). Adolescent pathways to adulthood drinking: Sport activity involvement is not necessarily risky or protective. Addiction, 103(S1), 6983.
Saddler, B., & Graham, S. (2005). The effects of peer-assisted sentence-combining instruction on the writing performance of more and less skilled young writers. Journal of Educational Psychology, 97(1), 43.
Satar, H. M., & Akcan, S. (2018). Pre-service EFL teachers’ online participation, interaction, and social presence. Language Learning & Technology, 22(1), 157183.
Scrucca, L., Fop, M., Murphy, T. B., & Raftery, A. E. (2016) mclust 5: Clustering, classification and density estimation using Gaussian finite mixture models. The R Journal, 8(1), 289317.
Shea, P., Hayes, S., Smith, S. U., Vickers, J., Bidjerano, T., Gozza-Cohen, M., … Tseng, C.-H. (2013). Online learner self-regulation: Learning presence viewed through quantitative content- and social network analysis. The International Review of Research in Open and Distributed Learning, 14(3), 427461.
Skehan, P. (1991). Individual differences in second language learning. Studies in Second Language Acquisition, 13(2), 275298.
Smith, B. (2017). Technology-enhanced SLA research. In C. Chapelle & S. Sauro (Eds.), The handbook of technology in second language teaching and learning (pp. 444–458). Oxford: Wiley-Blackwell.
Srivastava, A. N., & Sahami, M. (2009). Text mining: Classification, clustering, and applications. Boca Raton, FL: Chapman and Hall/CRC.
Sun, Y.-C., & Chang, Y. (2012). Blogging to learn: Becoming EFL academic writers through collaborative dialogues. Language Learning & Technology, 16(1), 4361.
Sun, Z., Lin, C.-H., Wu, M., Zhou, J., & Luo, L. (2018). A tale of two communication tools: Discussion-forum and mobile instant-messaging apps in collaborative learning. British Journal of Educational Technology, 49(2), 248261.
The Douglas Fir Group. (2016). A transdisciplinary framework for SLA in a multilingual world. Modern Language Journal, 100 (Supplement 2016), 1947.
The New London Group. (1996). A pedagogy of multiliteracies: Designing social futures. Harvard Educational Review, 66(1), 6093.
Thorne, S. L., & Smith, B. (2011). Second language development theories and technology-mediated language learning. CALICO Journal, 28(2), 268277.
Wang, D., Olson, J. S., Zhang, J., Nguyen, T., & Olson, G. M. (2015). DocuViz: Visualizing collaborative writing. In Proceedings of the 33rd Annual ACM conference on human factors in computing systems (pp. 18651874). New York, NY: Association for Computing Machinery.
Warschauer, M. (1997). Computer-mediated collaborative learning: Theory and practice. Modern Language Journal, 81(4), 470481.
Witten, I. H., Frank, E., Hall, M. A., & Pal, C. J. (2016). Data mining: Practical machine learning tools and techniques. Cambridge, UK: Morgan Kaufmann.
Wood, D., Bruner, J. S., & Ross, G. (1976). The role of tutoring in problem solving. Journal of Child Psychology and Psychiatry, 17(2), 89100.
Xie, K., Di Tosto, G., Lu, L., & Cho, Y. S. (2018). Detecting leadership in peer-moderated online collaborative learning through text mining and social network analysis. The Internet and Higher Education, 38, 917.
Yarrow, F., & Topping, K. J. (2001). Collaborative writing: The effects of metacognitive prompting and structured peer interaction. British Journal of Educational Psychology, 71(2), 261282.
Yim, S., & Warschauer, M. (2017). Web-based collaborative writing in L2 contexts: Methodological insights from text mining. Language Learning & Technology, 21(1), 146165.
Yim, S., & Warschauer, M. (2019a). Investigation of factors shaping synchronous collaborative writing: Text mining approach. Manuscript submitted for publication.
Yim, S., & Warschauer, M. (2019b). Student initiating feedback: Potential of social media. In Hyland, K. & Shaw, P. (Eds.), Feedback in Second Language Writing (pp. 285303). Cambridge, UK: Cambridge University Press.
Yim, S., Wang, D., Olson, J., Vu, V., & Warschauer, M. (2017). Synchronous collaborative writing in the classroom: Undergraduates’ collaboration practices and their impact on writing style, quality, and quantity. In Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing (pp. 468479). New York, NY: Association for Computing Machinery.
Zheng, B., & Warschauer, M. (2015). Participation, interaction, and academic achievement in an online discussion environment. Computers & Education, 84, 7889.
Zheng, B., Lawrence, J. F., Warschauer, M., & Lin, C.-H. (2015). Middle school students’ writing and feedback in a cloud-based classroom environment. Technology, Knowledge and Learning, 20, 201229.
Zhu, E. (2006). Interaction and cognitive engagement: An analysis of four asynchronous online discussions. Instructional Science, 34(6), 451480.

Related content

Powered by UNSILO

Recent Contributions of Data Mining to Language Learning Research

  • Mark Warschauer (a1), Soobin Yim (a2), Hansol Lee (a3) and Binbin Zheng (a4)


Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed.