Hostname: page-component-8448b6f56d-dnltx Total loading time: 0 Render date: 2024-04-25T02:32:46.344Z Has data issue: false hasContentIssue false

Comparison of text preprocessing methods

Published online by Cambridge University Press:  13 June 2022

Christine P. Chai*
Affiliation:
Microsoft Corporation, One Microsoft Way, Redmond, WA 98052, USA

Abstract

Text preprocessing is not only an essential step to prepare the corpus for modeling but also a key area that directly affects the natural language processing (NLP) application results. For instance, precise tokenization increases the accuracy of part-of-speech (POS) tagging, and retaining multiword expressions improves reasoning and machine translation. The text corpus needs to be appropriately preprocessed before it is ready to serve as the input to computer models. The preprocessing requirements depend on both the nature of the corpus and the NLP application itself, that is, what researchers would like to achieve from analyzing the data. Conventional text preprocessing practices generally suffice, but there exist situations where the text preprocessing needs to be customized for better analysis results. Hence, we discuss the pros and cons of several common text preprocessing methods: removing formatting, tokenization, text normalization, handling punctuation, removing stopwords, stemming and lemmatization, n-gramming, and identifying multiword expressions. Then, we provide examples of text datasets which require special preprocessing and how previous researchers handled the challenge. We expect this article to be a starting guideline on how to select and fine-tune text preprocessing methods.

Type
Survey Paper
Copyright
© The Author(s), 2022. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abd-Alrazaq, A., Alhuwail, D., Househ, M., Hamdi, M. and Shah, Z. (2020). Top concerns of Tweeters during the COVID-19 pandemic: Infoveillance study. Journal of Medical Internet Research (JMIR) 22(4), e19016.CrossRefGoogle ScholarPubMed
Abdelali, A., Darwish, K., Durrani, N. and Mubarak, H. (2016). Farasa: A fast and furious segmenter for Arabic. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pp. 1116.CrossRefGoogle Scholar
Abdolahi, M. and Zahedh, M. (2017). Sentence matrix normalization using most likely n-grams vector. In 2017 IEEE 4th International Conference on Knowledge-Based Engineering and Innovation (KBEI). IEEE, pp. 00400045.CrossRefGoogle Scholar
Abraham, A., Dutta, P., Mandal, J.K., Bhattacharya, A. and Dutta, S. (2018). Emerging Technologies in Data Mining and Information Security: Proceedings of IEMIS 2018, Volume 2, vol. 813. Springer. IEMIS Stands for International Conference on Emerging Technologies in Data Mining and Information Security.Google Scholar
Acree, B.D. (2016). Deep Learning and Ideological Rhetoric. PhD Dissertation, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.Google Scholar
Ács, J., Kádár, Á. and Kornai, A. (2021). Subword pooling makes a difference. arXiv preprint arXiv:2102.10864.Google Scholar
Adnan, K. and Akbar, R. (2019a). An analytical study of information extraction from unstructured and multidimensional big data. Journal of Big Data 6(1), 138.CrossRefGoogle Scholar
Adnan, K. and Akbar, R. (2019b). Limitations of information extraction methods and techniques for heterogeneous unstructured big data. International Journal of Engineering Business Management 11, 123.CrossRefGoogle Scholar
Aggarwal, C.C. and Zhai, C. (2012). Mining Text Data. Boston, MA, USA: Springer Science & Business Media.CrossRefGoogle Scholar
Agić, Ž., Merkler, D. and Berović, D. (2013). Parsing Croatian and Serbian by using Croatian dependency treebanks. In Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages, pp. 2233.Google Scholar
Agre, G., van Genabith, J. and Declerck, T. (2018). Artificial Intelligence: Methodology, Systems, and Applications: 18th International Conference, AIMSA 2018, Varna, Bulgaria, 12–14 September 2018, Proceedings, vol. 11089. Springer.CrossRefGoogle Scholar
Akhtar, A.K., Sahoo, G. and Kumar, M. (2017). Digital corpus of Santali language. In 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI). IEEE, pp. 934938.CrossRefGoogle Scholar
Akkasi, A., Varoğlu, E. and Dimililer, N. (2016). ChemTok: A new rule based tokenizer for chemical named entity recognition. BioMed Research International 2016, 19.CrossRefGoogle ScholarPubMed
Al-Khafaji, H.K. and Habeeb, A.T. (2017). Efficient algorithms for preprocessing and stemming of tweets in a sentiment analysis system. International Organization of Scientific Research – Journal of Computer Engineering 9(3), 4450.Google Scholar
Al-Molegi, A., Alsmadi, I., Najadat, H. and Albashiri, H. (2015). Automatic learning of Arabic text categorization. International Journal of Digital Contents and Applications 2(1), 116.Google Scholar
Al Sharou, K., Li, Z. and Specia, L. (2021). Towards a better understanding of noise in natural language processing. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP), pp. 5362.CrossRefGoogle Scholar
Albalawi, R., Yeap, T.H. and Benyoucef, M. (2020). Using topic modeling methods for short-text data: A comparative analysis. Frontiers in Artificial Intelligence 3, 42.CrossRefGoogle ScholarPubMed
Albishre, K., Albathan, M. and Li, Y. (2015). Effective 20 newsgroups dataset cleaning. In 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), vol. 3. IEEE, pp. 98101.CrossRefGoogle Scholar
Aliwy, A.H. (2012). Tokenization as preprocessing for Arabic tagging system. International Journal of Information and Education Technology 2(4), 348353.CrossRefGoogle Scholar
Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E.D., Gutierrez, J.B. and Kochut, K. (2017). A brief survey of text mining: Classification, clustering and extraction techniques. arXiv preprint arXiv:1707.02919.Google Scholar
Alonso, M.A., Gómez-Rodríguez, C. and Vilares, J. (2021). On the use of parsing for named entity recognition. Applied Sciences 11(3), 1090.CrossRefGoogle Scholar
Amarasinghe, K., Manic, M. and Hruska, R. (2015). Optimal stop word selection for text mining in critical infrastructure domain. In 2015 Resilience Week (RWS). IEEE, pp. 16.CrossRefGoogle Scholar
Anandarajan, M., Hill, C. and Nolan, T. (2019). Text preprocessing. In Practical Text Analytics. Springer, pp. 4559.CrossRefGoogle Scholar
Anderwald, L. (2003). Negation in Non-Standard British English: Gaps, Regularizations and Asymmetries. London, UK: Routledge.CrossRefGoogle Scholar
Angiani, G., Ferrari, L., Fontanini, T., Fornacciari, P., Iotti, E., Magliani, F. and Manicardi, S. (2016). A comparison between preprocessing techniques for sentiment analysis in Twitter. In Proceedings of the 2nd International Workshop on Knowledge Discovery on the WEB (KDWeb).Google Scholar
Arens, R. (2004). A preliminary look into the use of named entity information for bioscience text tokenization. In Proceedings of the Student Research Workshop at HLT-NAACL 2004, pp. 3742. HLT-NAACL stands for Human Language Technologies – North American Chapter of the Association for Computational Linguistics.CrossRefGoogle Scholar
Arief, M. and Deris, M.B.M. (2021). Text preprocessing impact for sentiment classification in product review. In 2021 Sixth International Conference on Informatics and Computing (ICIC). IEEE, pp. 17.CrossRefGoogle Scholar
Armano, G., Fanni, F. and Giuliani, A. (2015). Stopwords identification by means of characteristic and discriminant analysis. In ICAART (2), pp. 353360. ICCART stands for International Conference on Agents and Artificial Intelligence.CrossRefGoogle Scholar
Armengol Estapé, J. (2021). A Pipeline for Large Raw Text Preprocessing and Model Training of Language Models at Scale. Master’s Thesis, Universitat Politècnica de Catalunya, Barcelona, Spain.Google Scholar
Arora, C., Sabetzadeh, M., Briand, L. and Zimmer, F. (2016). Extracting domain models from natural-language requirements: Approach and industrial evaluation. In Proceedings of the ACM/IEEE 19th International Conference on Model Driven Engineering Languages and Systems, pp. 250260.CrossRefGoogle Scholar
Arun, R., Suresh, V. and Madhavan, C.V. (2009). Stopword graphs and authorship attribution in text corpora. In 2009 IEEE International Conference on Semantic Computing. IEEE, pp. 192196.CrossRefGoogle Scholar
Ashok, A., Elmasri, R. and Natarajan, G. (2019). Comparing different word embeddings for multiword expression identification. In International Conference on Applications of Natural Language to Information Systems. Springer, pp. 295302.Google Scholar
Attia, M. (2007). Arabic tokenization system. In Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources, pp. 6572.CrossRefGoogle Scholar
Atwood, J. (2008). Stack Overflow Podcast Episode 32. Available from: https://stackoverflow.blog/2008/12/04/podcast-32/.Google Scholar
Au, T.C. (2014). Topics in Computational Advertising. PhD Dissertation, Duke University, Durham NC, USA.Google Scholar
Aye, T.T. (2011). Web log cleaning for mining of web usage patterns. In 2011 3rd International Conference on Computer Research and Development, vol. 2. IEEE, pp. 490494.CrossRefGoogle Scholar
Ayral, H. and Yavuz, S. (2011). An automated domain specific stop word generation method for natural language text classification. In 2011 International Symposium on Innovations in Intelligent Systems and Applications. IEEE, pp. 500503.CrossRefGoogle Scholar
Babar, S. and Patil, P.D. (2015). Improving performance of text summarization. Procedia Computer Science 46, 354363.CrossRefGoogle Scholar
Baerman, M. (2015). The Oxford Handbook of Inflection. Oxford, UK: Oxford University Press.CrossRefGoogle Scholar
Baldwin, R.S. and Coady, J.M. (1978). Psycholinguistic approaches to a theory of punctuation. Journal of Reading Behavior 10(4), 363375.CrossRefGoogle Scholar
Baldwin, T. and Kim, S.N. (2010). Multiword expressions. Handbook of Natural Language Processing 2, 267292.Google Scholar
Baldwin, T. and Li, Y. (2015). An in-depth analysis of the effect of text normalization in social media. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 420429.CrossRefGoogle Scholar
Bao, Y., Quan, C., Wang, L. and Ren, F. (2014). The role of pre-processing in Twitter sentiment analysis. In International Conference on Intelligent Computing. Springer, pp. 615624.Google Scholar
Bardoel, T. (2012). Comparing n-gram frequency distributions. Technical report, Tilburg center for Cognition and Communication (TiCC), Tilburg University, Tilburg, Netherlands.Google Scholar
Barr, C., Jones, R. and Regelson, M. (2008). The linguistic structure of English web-search queries. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pp. 10211030.CrossRefGoogle Scholar
Barrett, N. and Weber-Jahnke, J. (2011). Building a biomedical tokenizer using the token lattice design pattern and the adapted Viterbi algorithm. BMC Bioinformatics 12(3), S1.CrossRefGoogle ScholarPubMed
Barrón-Cedeño, A. and Rosso, P. (2009). On automatic plagiarism detection based on n-grams comparison. In European Conference on Information Retrieval. Springer, pp. 696700.CrossRefGoogle Scholar
Barteld, F. (2017). Detecting spelling variants in non-standard texts. In Proceedings of the Student Research Workshop at the Fifteenth Conference of the European Chapter of the Association for Computational Linguistics, pp. 1122.CrossRefGoogle Scholar
Basta, C., Costa-jussà, M.R. and Casas, N. (2021). Extensive study on the underlying gender bias in contextualized word embeddings. Neural Computing and Applications 33(8), 33713384.CrossRefGoogle Scholar
Batrinca, B. and Treleaven, P.C. (2015). Social media analytics: A survey of techniques, tools and platforms. AI & Society 30(1), 89116.CrossRefGoogle Scholar
Battenberg, E. (2012). Ratings prediction using linear regression on text reviews. Technical report, Berkeley Institute of Design (BID), University of California – Berkeley, Berkeley CA, United States.Google Scholar
Beaufays, F. and Strope, B. (2013). Language model capitalization. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, pp. 67496752.CrossRefGoogle Scholar
Beaver, I. (2019). pycontractions 2.0.1. Python library. Available from: https://pypi.org/project/pycontractions/.Google Scholar
Beitzel, S.M., Jensen, E.C., Lewis, D.D., Chowdhury, A. and Frieder, O. (2007). Automatic classification of web queries using very large unlabeled query logs. ACM Transactions on Information Systems (TOIS) 25(2), 129.CrossRefGoogle Scholar
Bekkerman, R. and Allan, J. (2004). Using bigrams in text categorization. Technical report, IR-408, Center of Intelligent Information Retrieval, University of Massachusetts at Amherst, Amherst MA, United States.Google Scholar
Bender, E.M. (2011). On achieving and evaluating language-independence in NLP. Linguistic Issues in Language Technology 6(3), 126.CrossRefGoogle Scholar
Bender, E.M. and Friedman, B. (2018). Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics 6, 587604.CrossRefGoogle Scholar
Bengfort, B., Bilbro, R. and Ojeda, T. (2018). Applied Text Analysis with Python: Enabling Language-Aaware Data Products with Machine Learning. Sebastopol, CA, USA: O’Reilly Media Inc.Google Scholar
Bengio, Y., Ducharme, R., Vincent, P. and Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research 3, 11371155.Google Scholar
Benoit, K., Muhr, D. and Watanabe, K. (2017). stopwords: Multilingual Stopword Lists. $\mathsf{R}$ package version 0.9.0. Available from: https://CRAN.R-project.org/package=stopwords.Google Scholar
Benoit, K., Watanabe, K., Wang, H., Nulty, P., Obeng, A., Müller, S. and Matsuo, A. (2018). quanteda: An $\mathsf{R}$ package for the quantitative analysis of textual data. Journal of Open Source Software 3(30), 774. Available from: https://quanteda.io.CrossRefGoogle Scholar
Berberich, K. and Bedathur, S. (2013). Computing n-gram statistics in MapReduce. In Proceedings of the 16th International Conference on Extending Database Technology, pp. 101112.CrossRefGoogle Scholar
Berend, G. (2018). L1 regularization of word embeddings for multi-word expression identification. Acta Cybernetica 23(3), 801813.CrossRefGoogle Scholar
Bergmanis, T. and Goldwater, S. (2018). Context sensitive neural lemmatization with Lematus. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 13911400.CrossRefGoogle Scholar
Bernardy, J.-P. and Chatzikyriakidis, S. (2019). What kind of natural language inference are NLP systems learning: Is this enough? In ICAART (2), pp. 919931. ICAART stands for International Conference on Agents and Artificial Intelligence.Google Scholar
Bhasuran, B., Murugesan, G., Abdulkadhar, S. and Natarajan, J. (2016). Stacked ensemble combined with fuzzy matching for biomedical named entity recognition of diseases. Journal of Biomedical Informatics 64, 19.CrossRefGoogle ScholarPubMed
Bi, Y. (2016). Scheduling Optimization with LDA and Greedy Algorithm. Master’s Thesis, Duke University, Durham NC, United States. LDA stands for latent Dirichlet allocation.Google Scholar
Bird, S., Loper, E. and Klein, E. (2009). Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. USA: O’Reilly Media Inc.Google Scholar
Blanchard, A. (2007). Understanding and customizing stopword lists for enhanced patent mapping. World Patent Information 29(4), 308316.CrossRefGoogle Scholar
Blanco, E. and Moldovan, D. (2011). Some issues on detecting negation from text. In Proceedings of the Twenty-Fourth International Florida Artificial Intelligence Research Society (FLAIRS) Conference.Google Scholar
Blei, D.M. and Lafferty, J.D. (2009). Visualizing topics with multi-word expressions. arXiv preprint arXiv:0907.1013.Google Scholar
Blei, D.M., Ng, A.Y. and Jordan, M.I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research 3, 9931022.Google Scholar
Bodapati, S., Yun, H. and Al-Onaizan, Y. (2019). Robustness to capitalization errors in named entity recognition. arXiv preprint arXiv:1911.05241.Google Scholar
Bokka, K.R., Hora, S., Jain, T. and Wambugu, M. (2019). Deep Learning for Natural Language Processing: Solve your Natural Language Processing Problems with Smart Deep Neural Networks. Birmingham, UK: Packt Publishing Ltd.Google Scholar
Bollen, J., Mao, H. and Zeng, X. (2011). Twitter mood predicts the stock market. Journal of Computational Science 2(1), 18.CrossRefGoogle Scholar
Bollmann, M. (2013). POS tagging for historical texts with sparse training data. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pp. 1118.Google Scholar
Bollmann, M. (2019). A large-scale comparison of historical text normalization systems. arXiv preprint arXiv:1904.02036.Google Scholar
Bouchet-Valat, M. (2019). SnowballC: Snowball Stemmers Based on the C ‘libstemmer’ UTF-8 Library. $\mathsf{R}$ package version 0.6.0. Available from: https://CRAN.R-project.org/package=SnowballC.Google Scholar
Boukobza, R. and Rappoport, A. (2009). Multi-word expression identification using sentence surface features. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 468477.CrossRefGoogle Scholar
Brants, T., Popat, A.C., Xu, P., Och, F.J. and Dean, J. (2007). Large language models in machine translation. Technical report, Google Research.Google Scholar
Briscoe, T. (1996). The syntax and semantics of punctuation and its use in interpretation. In Proceedings of the Association for Computational Linguistics Workshop on Punctuation, pp. 17.Google Scholar
Broder, A.Z., Glassman, S.C., Manasse, M.S. and Zweig, G. (1997). Syntactic clustering of the web. Computer Networks and ISDN Systems 29(8–13), 11571166. ISDN stands for Integrated Services Digital Network.CrossRefGoogle Scholar
Brooke, J., Hammond, A. and Hirst, G. (2015). GutenTag: An NLP-driven tool for digital humanities research in the Project Gutenberg corpus. In Proceedings of the Fourth Workshop on Computational Linguistics for Literature, pp. 4247.CrossRefGoogle Scholar
Brown, P.F., Desouza, P.V., Mercer, R.L., Pietra, V. J.D. and Lai, J.C. (1992). Class-based n-gram models of natural language. Computational Linguistics 18(4), 467479.Google Scholar
Buck, C., Heafield, K. and Van Ooyen, B. (2014). N-gram counts and language models from the common crawl. In LREC (International Conference on Language Resources and Evaluation), pp. 35793584.Google Scholar
Buerki, A. (2017). Frequency consolidation among word n-grams. In International Conference on Computational and Corpus-Based Phraseology. Springer, pp. 432446.CrossRefGoogle Scholar
Cabot, C., Soualmia, L.F., Dahamna, B. and Darmoni, S.J. (2016). SIBM at CLEF eHealth Evaluation Lab 2016: Extracting concepts in French medical texts with ECMT and CIMIND. In CLEF (Working Notes), pp. 4760. CLEF stands for Conference and Labs of the Evaluation Forum.Google Scholar
Cai, H., Yang, Y., Li, X. and Huang, Z. (2015). What are popular: Exploring Twitter features for event detection, tracking and visualization. In Proceedings of the 23rd ACM International Conference on Multimedia, pp. 8998.CrossRefGoogle Scholar
Callan, J., Hoy, M., Yoo, C. and Zhao, L. (2009). The ClueWeb09 dataset. Available from: http://lemurproject.org/clueweb09/.Google Scholar
Camacho-Collados, J. and Pilehvar, M.T. (2018). On the role of text preprocessing in neural network architectures: An evaluation study on text categorization and sentiment analysis. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 4046. EMNLP stands for Conference on Empirical Methods in Natural Language Processing.Google Scholar
Cambria, E., Poria, S., Gelbukh, A. and Thelwall, M. (2017). Sentiment analysis is a big suitcase. IEEE Intelligent Systems 32(6), 7480.CrossRefGoogle Scholar
Campbell, W.M., Li, L., Dagli, C., Acevedo-Aviles, J., Geyer, K., Campbell, J.P. and Priebe, C. (2016). Cross-domain entity resolution in social media. arXiv preprint arXiv:1608.01386.Google Scholar
Carvalho, G., de Matos, D.M. and Rocio, V. (2007). Document retrieval for question answering: A quantitative evaluation of text preprocessing. In Proceedings of the ACM First PhD Workshop in CIKM, pp. 125130. CIKM stands for Conference on Information and Knowledge Management.CrossRefGoogle Scholar
Chai, C.P. (2017). Statistical Issues in Quantifying Text Mining Performance. PhD Dissertation, Duke University, Durham NC, USA.Google Scholar
Chai, C.P. (2019). Text mining in survey data. Survey Practice 12(1), 113.CrossRefGoogle Scholar
Chai, C.P. (2020). The importance of data cleaning: Three visualization examples. CHANCE 33(1), 49.CrossRefGoogle Scholar
Chai, C.P. (2022). Word distinctivity – Quantifying improvement of topic modeling results from n-gramming. REVSTAT–Statistical Journal 20(2), 199–220.Google Scholar
Chang, J.P., Chiam, C., Fu, L., Wang, A.Z., Zhang, J. and Danescu-Niculescu-Mizil, C. (2020). Convokit: A toolkit for the analysis of conversations. arXiv preprint arXiv:2005.04246.Google Scholar
Chatterjee, S. and Krystyanczuk, M. (2017). Python Social Media Analytics. Birmingham, UK: Packt Publishing Ltd. Book excerpt. Available from: https://hub.packtpub.com/clean-social-media-data-analysis-python/.Google Scholar
Chaudhary, G. and Kshirsagar, M. (2018). Overview and application of text data pre-processing techniques for text mining on health news Tweets. Helix 8(5), 37643768.CrossRefGoogle Scholar
Chen, Q., Yao, L. and Yang, J. (2016). Short text classification based on LDA topic model. In 2016 International Conference on Audio, Language and Image Processing (ICALIP). IEEE, pp. 749753. LDA stands for latent Dirichlet allocation.CrossRefGoogle Scholar
Chen, Y., Zhang, H., Liu, R., Ye, Z. and Lin, J. (2019). Experimental explorations on short text topic mining between LDA and NMF based schemes. Knowledge-Based Systems 163, 113. LDA stands for latent Dirichlet allocation, and NMF stands for non-negative matrix factorization. Google Scholar
Chomsky, N. (1969). Deep Structure, Surface Structure, and Semantic Interpretation. Indiana University Linguistics Club.Google Scholar
Chua, M., Van Esch, D., Coccaro, N., Cho, E., Bhandari, S. and Jia, L. (2018). Text normalization infrastructure that scales to hundreds of language varieties. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).Google Scholar
Cirqueira, D., Pinheiro, M.F., Jacob, A., Lobato, F. and Santana, Á. (2018). A literature review in preprocessing for sentiment analysis for Brazilian Portuguese social media. In 2018 IEEE/WIC/ACM International Conference on Web Intelligence (WI). IEEE, pp. 746749.CrossRefGoogle Scholar
Clark, E. and Araki, K. (2011). Text normalization in social media: Progress, problems and applications for a pre-processing system of casual English. Procedia-Social and Behavioral Sciences 27, 211.CrossRefGoogle Scholar
Clough, P. (2001). A Perl program for sentence splitting using rules. Technical report, University of Sheffield, Sheffield, United Kingdom.Google Scholar
Cohen, K.B., Acquaah-Mensah, G.K., Dolbey, A.E. and Hunter, L. (2002). Contrast and variability in gene names. In Proceedings of the ACL-02 Workshop on Natural Language Processing in the Biomedical Domain, vol. 3. Association for Computational Linguistics (ACL), pp. 1420.CrossRefGoogle Scholar
Cohen, K.B., Hunter, L.E. and Pressman, P.S. (2019). P-hacking lexical richness through definitions of “type” and “token”. Studies in Health Technology and Informatics 264, 14331434.Google ScholarPubMed
Cohen, K.B., Ogren, P.V., Fox, L. and Hunter, L. (2005). Empirical data on corpus design and usage in biomedical natural language processing. In American Medical Informatics Association (AMIA) Annual Symposium Proceedings, vol. 2005, p. 156.Google Scholar
Cohen, K.B., Roeder, C., Baumgartner, W. A. Jr, Hunter, L.E. and Verspoor, K. (2010). Test suite design for ontology concept recognition systems, pp. 441446.Google Scholar
Cohen, K.B., Tanabe, L., Kinoshita, S. and Hunter, L. (2004). A resource for constructing customized test suites for molecular biology entity identification systems. In HLT-NAACL 2004 Workshop: Linking Biological Literature, Ontologies and Databases, pp. 1–8. HLT-NAACL stands for Human Language Technologies – North American Chapter of the Association for Computational Linguistics.Google Scholar
Congjun, L. and Hill, N.W. (2021). Recent developments in Tibetan NLP. Transactions on Asian and Low-Resource Language Information Processing 20(2), 13.CrossRefGoogle Scholar
Constant, M., Eryiğit, G., Monti, J., Van Der Plas, L., Ramisch, C., Rosner, M. and Todirascu, A. (2017). Multiword expression processing: A survey. Computational Linguistics 43(4), 837892.CrossRefGoogle Scholar
Corbett, P. and Boyle, J. (2018). Improving the learning of chemical-protein interactions from literature using transfer learning and specialized word embeddings. Database 2018, 1–10.CrossRefGoogle ScholarPubMed
Corral, Á., Boleda, G. and Ferrer-i Cancho, R. (2015). Zipf’s law for word frequencies: Word forms versus lemmas in long texts. PLOS One 10(7), e0129031.CrossRefGoogle ScholarPubMed
Councill, I., McDonald, R. and Velikovich, L. (2010). What’s great and what’s not: Learning to classify the scope of negation for improved sentiment analysis. In Proceedings of the Workshop on Negation and Speculation in Natural Language Processing, pp. 5159.Google Scholar
Craiger, P. and Shenoi, S. (2007). Advances in Digital Forensics III: IFIP International Conference on Digital Forensics, National Center for Forensic Science, Orlando Florida, 2831 January 2007, vol. 242. Springer. IFIP stands for International Federation for Information Processing.CrossRefGoogle Scholar
Craven, T.C. (2004). Variations in use of meta tag keywords by web pages in different languages. Journal of Information Science 30(3), 268279.CrossRefGoogle Scholar
Crystal, D. (1999). The Penguin Dictionary of Language. London, UK: Penguin Group.Google Scholar
Cutrone, L. and Chang, M. (2011). Auto-assessor: Computerized assessment system for marking student’s short-answers automatically. In 2011 IEEE International Conference on Technology for Education. IEEE, pp. 8188.CrossRefGoogle Scholar
Dařena, F. (2019). VecText: Converting documents to vectors. IAENG International Journal of Computer Science 46(2). IAENG stands for International Association of Engineers. Google Scholar
Das, B., Pal, S., Mondal, S.K., Dalui, D. and Shome, S.K. (2013). Automatic keyword extraction from any text document using N-gram rigid collocation. International Journal of Soft Computing and Engineering (IJSCE) 3(2), 238242.Google Scholar
Denny, M.J. and Spirling, A. (2018). Text preprocessing for unsupervised learning: Why it matters, when it misleads, and what to do about it. Political Analysis 26(2), 168189.CrossRefGoogle Scholar
Deoras, A., Mikolov, T. and Church, K. (2011). A fast re-scoring strategy to capture long-distance dependencies. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 11161127.Google Scholar
Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.Google Scholar
Daz, N.P.C. and López, M.J.M. (2015). An analysis of biomedical tokenization: Problems and strategies. In Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis, pp. 4049.CrossRefGoogle Scholar
Dinov, I.D. (2018). Data Science and Predictive Analytics: Biomedical and Health Applications using $\mathsf{R}$ . Ann Arbor, MI, USA: Springer.CrossRefGoogle Scholar
Dodge, J., Gururangan, S., Card, D., Schwartz, R. and Smith, N.A. (2019). Show your work: Improved reporting of experimental results. arXiv preprint arXiv:1909.03004.Google Scholar
Domingo, M., Garca-Martnez, M., Helle, A., Casacuberta, F. and Herranz, M. (2018). How much does tokenization affect neural machine translation? arXiv preprint arXiv:1812.08621.Google Scholar
Doraisamy, S. and Rüger, S. (2003). Robust polyphonic music retrieval with n-grams. Journal of Intelligent Information Systems 21(1), 5370.CrossRefGoogle Scholar
Dressel, F. (2016). Distribution of n-grams in English text corpus. Available from: http://rpubs.com/fdd/187848.Google Scholar
Dror, R., Baumer, G., Shlomov, S. and Reichart, R. (2018). The hitchhiker’s guide to testing statistical significance in natural language processing. In Proceedings of the Fifty-Sixth Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 13831392.CrossRefGoogle Scholar
Du, J., Yu, P. and Zong, C. (2018). Towards computing technologies on machine parsing of English and Chinese garden path sentences. In Proceedings of the Future Technologies Conference. Springer, pp. 806827.Google Scholar
Duan, J., Lu, R., Wu, W., Hu, Y. and Tian, Y. (2006). A bio-inspired approach for multi-word expression extraction. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions. Association for Computational Linguistics, pp. 176182. COLING stands for International Conference on Computational Linguistics.CrossRefGoogle Scholar
Dunning, T. (1994). Statistical identification of language. In Computing Research Laboratory Technical Memo MCCS 94-273. New Mexico State University.Google Scholar
Duran, M.S., Avanço, L., Alusio, S., Pardo, T. and Nunes, M.d.G.V. (2014). Some issues on the normalization of a corpus of products reviews in Portuguese. In Proceedings of the 9th Web as Corpus Workshop (WaC-9), pp. 2228.CrossRefGoogle Scholar
Effrosynidis, D., Symeonidis, S. and Arampatzis, A. (2017). A comparison of pre-processing techniques for Twitter sentiment analysis. In International Conference on Theory and Practice of Digital Libraries. Springer, pp. 394406.CrossRefGoogle Scholar
Ek, A., Bernardy, J.-P. and Chatzikyriakidis, S. (2020). How does punctuation affect neural models in natural language inference. In Proceedings of the Probability and Meaning Conference (PaM 2020), pp. 109116.Google Scholar
El-Khair, I.A. (2017). Effects of stop words elimination for Arabic information retrieval: A comparative study. arXiv preprint arXiv:1702.01925.Google Scholar
Elming, J., Johannsen, A., Klerke, S., Lapponi, E., Alonso, H.M. and Søgaard, A. (2013). Down-stream effects of tree-to-dependency conversions. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 617626.Google Scholar
Fan, A., Doshi-Velez, F. and Miratrix, L. (2017). Promoting domain-specific terms in topic models with informative priors. arXiv preprint arXiv:1701.03227.Google Scholar
Fancellu, F., Lopez, A. and Webber, B. (2016). Neural networks for negation scope detection. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 495504.CrossRefGoogle Scholar
Fancellu, F., Lopez, A., Webber, B. and He, H. (2017). Detecting negation scope is easy, except when it isn’t. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 5863.CrossRefGoogle Scholar
Farzindar, A.A. and Inkpen, D. (2020). Natural language processing for social media. Synthesis Lectures on Human Language Technologies 13(2), 1219.CrossRefGoogle Scholar
Fatyanosa, T.N. and Bachtiar, F.A. (2017). Classification method comparison on Indonesian social media sentiment analysis. In 2017 International Conference on Sustainable Information Engineering and Technology (SIET). IEEE, pp. 310315.CrossRefGoogle Scholar
Feinerer, I. and Hornik, K. (2018). tm: Text Mining Package. $\mathsf{R}$ package version 0.7.6. Available from: https://CRAN.R-project.org/package=tm.Google Scholar
Ferilli, S., Esposito, F. and Grieco, D. (2014). Automatic learning of linguistic resources for stopword removal and stemming from text. Procedia Computer Science 38, 116123.CrossRefGoogle Scholar
Ferret, O., Grau, B., Hurault-Plantet, M., Illouz, G., Jacquemin, C., Monceaux, L. and Vilnat, A. (2002). How NLP can improve question answering. Knowledge Organization 29(3–4), 135155.Google Scholar
Fesseha, A., Xiong, S., Emiru, E.D., Diallo, M. and Dahou, A. (2021). Text classification based on convolutional neural networks and word embedding for low-resource languages: Tigrinya. Information 12(2), 52.CrossRefGoogle Scholar
Finlayson, M. and Kulkarni, N. (2011). Detecting multi-word expressions improves word sense disambiguation. In Proceedings of the Workshop on Multiword Expressions: From Parsing and Generation to the Real World, pp. 2024.Google Scholar
Fisher, N. (2013). Analytics for Leaders: A Performance Measurement System for Business Success. Cambridge, UK: Cambridge University Press.CrossRefGoogle Scholar
Fisher, N. (2019). A comprehensive approach to problems of performance measurement. Journal of the Royal Statistical Society: Series A (Statistics in Society) 182(3), 755803.CrossRefGoogle Scholar
Fisher, N. and Lee, A. (2011). Getting the ‘correct’ answer from survey responses: A simple application of the EM algorithm. Australian & New Zealand Journal of Statistics 53(3), 353364.CrossRefGoogle Scholar
Forst, M. and Kaplan, R.M. (2006). The importance of precise tokenizing for deep grammars. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06).Google Scholar
Fosler-Lussier, E. (1998). Markov models and hidden Markov models: A brief tutorial. International Computer Science Institute.Google Scholar
Foster, J., Cetinoglu, O., Wagner, J., Le Roux, J., Hogan, S., Nivre, J., Hogan, D., and Van Genabith, J. 2011. # hardtoparse: POS tagging and parsing the Twitterverse. In Workshops at the Twenty-Fifth AAAI Conference on Artificial Intelligence. AAAI stands for Association for the Advancement of Artificial Intelligence.Google Scholar
Fowler, F. J. 1995. Improving Survey Questions: Design and Evaluation, vol. 38. Thousand Oaks, CA, USA: Sage.Google Scholar
Fox, C. 1989. A stop list for general text. In ACM SIGIR Forum, vol. 24, pp. 1921. ACM. SIGIR stands for Special Interest Group on Information Retrieval.CrossRefGoogle Scholar
Friedman, H.H. and Amoo, T. (1999). Rating the rating scales. Journal of Marketing Management 9(3), 114123.Google Scholar
Frigau, L., Wu, Q. and Banks, D. (2021). Optimizing the JSM program. Journal of the American Statistical Association 00(0), 110. JSM stands for Joint Statistical Meetings.Google Scholar
Fundel, K., Küffner, R. and Zimmer, R. (2007). RelEx – relation extraction using dependency parse trees. Bioinformatics 23(3), 365371.CrossRefGoogle ScholarPubMed
Gabernet, A.R. and Limburn, J. (2017). Breaking the 80/20 rule: How data catalogs transform data scientists’ productivity. Available from: https://www.ibm.com/cloud/blog/ibm-data-catalog-data-scientists-productivity.Google Scholar
Gamallo, P., Campos, J.R.P. and Alegria, I. (2017). A perplexity-based method for similar languages discrimination. In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), pp. 109114.CrossRefGoogle Scholar
Garíca, S., Ramírez-Gallego, S., Luengo, J., Benítez, J.M. and Herrera, F. (2016). Big data preprocessing: Methods and prospects. Big Data Analytics 1(1), 122.Google Scholar
Gentzkow, M., Kelly, B.T. and Taddy, M. (2017). Text as data. Technical report, National Bureau of Economic Research, Cambridge, MA, USA.CrossRefGoogle Scholar
Gerlach, M., Shi, H. and Amaral, L.A.N. (2019). A universal information theoretic approach to the identification of stopwords. Nature Machine Intelligence 1(12), 606612.CrossRefGoogle Scholar
Gerz, D., Vulić, I., Ponti, E., Naradowsky, J., Reichart, R. and Korhonen, A. (2018). Language modeling for morphologically rich languages: Character-aware modeling for word-level prediction. Transactions of the Association for Computational Linguistics 6, 451465.CrossRefGoogle Scholar
Ghosh, S., Johansson, R., Riccardi, G. and Tonelli, S. (2011). Shallow discourse parsing with conditional random fields. In Proceedings of 5th International Joint Conference on Natural Language Processing, pp. 10711079.Google Scholar
Globerson, A. and Roweis, S. (2006). Nightmare at test time: Robust learning by feature deletion. In Proceedings of the Twenty-Third International Conference on Machine Learning. ACM, pp. 353360.CrossRefGoogle Scholar
Goldberg, Y. and Orwant, J. (2013). A dataset of syntactic-ngrams over time from a very large corpus of English books. Technical report, Google Research.Google Scholar
Gomes, F.B., Adán-Coello, J.M. and Kintschner, F.E. (2018). Studying the effects of text preprocessing and ensemble methods on sentiment analysis of Brazilian Portuguese Tweets. In International Conference on Statistical Language and Speech Processing. Springer, pp. 167177.CrossRefGoogle Scholar
Goodman, E.L., Zimmerman, C. and Hudson, C. (2020). Packet2vec: Utilizing word2vec for feature extraction in packet data. arXiv preprint arXiv:2004.14477.Google Scholar
Gorrell, G., Petrak, J. and Bontcheva, K. (2015). Using @Twitter conventions to improve #LOD-based named entity disambiguation. In European Semantic Web Conference. Springer, pp. 171186. LOD stands for Linked Open Data.CrossRefGoogle Scholar
Goyvaerts, J. and Levithan, S. (2012). Regular Expressions Cookbook. Sebastopol, CA, USA: O’Reilly Media Inc.Google Scholar
Grabar, N., Zweigenbaum, P., Soualmia, L. and Darmoni, S. (2003). Matching controlled vocabulary words. In Studies in Health Technology and Informatics, pp. 445450.Google Scholar
Grana, J., Alonso, M.A. and Vilares, M. (2002). A common solution for tokenization and part-of-speech tagging. In International Conference on Text, Speech and Dialogue, vol. 2448. Springer, pp. 310.Google Scholar
Grefenstette, G. and Tapanainen, P. (1994). What is a word, what is a sentence? Problems of tokenisation. Technical report, Rank Xerox Research Centre, Grenoble Laboratory, Meylan, France.Google Scholar
Gries, S.T. and Mukherjee, J. (2010). Lexical gravity across varieties of English: An ICE-based study of n-grams in Asian Englishes. International Journal of Corpus Linguistics 15(4), 520548. ICE stands for International Corpus of English.CrossRefGoogle Scholar
Grishman, R. (2015). Information extraction. IEEE Intelligent Systems 30(5), 815.CrossRefGoogle Scholar
Grobelnik, M. and Mladenic, D. (2004). Text-mining tutorial. Technical report, Jožef Stefan Institute (JSI), Slovenia.Google Scholar
Grön, L. and Bertels, A. (2018). Clinical sublanguages: Vocabulary structure and its impact on term weighting. Terminology: International Journal of Theoretical and Applied Issues in Specialized Communication 24(1), 4165.Google Scholar
Groza, T. and Verspoor, K. (2014). Automated generation of test suites for error analysis of concept recognition systems. In Proceedings of the Australasian Language Technology Association Workshop 2014, pp. 2331.Google Scholar
Guthrie, D., Allison, B., Liu, W., Guthrie, L. and Wilks, Y. (2006). A closer look at skip-gram modelling. In LREC (International Conference on Language Resources and Evaluation), vol. 6, pp. 12221225.Google Scholar
Ha, L., Hanna, P., Ming, J. and Smith, F. (2009). Extending Zipf’s law to n-grams for large corpora. Artificial Intelligence Review 32(1–4), 101113.CrossRefGoogle Scholar
Habert, B., Adda, G., Adda-Decker, M., de Marëuil, P.B., Ferrari, S., Ferret, O., Illouz, G. and Paroubek, P. (1998). Towards tokenization evaluation. In Proceedings of the First International Conference on Language Resources and Evaluation (LREC), pp. 427431.Google Scholar
HaCohen-Kerner, Y., Miller, D. and Yigal, Y. (2020). The influence of preprocessing on text classification using a bag-of-words representation. PLOS One 15(5), e0232525.CrossRefGoogle ScholarPubMed
Haddi, E., Liu, X. and Shi, Y. (2013). The role of text pre-processing in sentiment analysis. Procedia Computer Science 17, 2632.CrossRefGoogle Scholar
Hajba, G.L. (2018). Using beautiful soup. In Website Scraping with Python. Springer, pp. 4196.CrossRefGoogle Scholar
Hansen, C., Hansen, C., Simonsen, J.G. and Lioma, C. (2018). The Copenhagen team participation in the check-worthiness task of the competition of automatic identification and verification of claims in political debates of the CLEF-2018 CheckThat! lab. In CLEF (Working Notes). CLEF stands for Conference and Labs of the Evaluation Forum.Google Scholar
Hartrumpf, S., Helbig, H. and Osswald, R. (2006). Semantic interpretation of prepositions for NLP applications. In Proceedings of the Third ACL-SIGSEM Workshop on Prepositions.CrossRefGoogle Scholar
He, W., Zha, S. and Li, L. (2013). Social media competitive analysis and text mining: A case study in the pizza industry. International Journal of Information Management 33(3), 464472.CrossRefGoogle Scholar
Head, M.L., Holman, L., Lanfear, R., Kahn, A.T. and Jennions, M.D. (2015). The extent and consequences of p-hacking in science. PLOS Biology 13(3), e1002106.CrossRefGoogle ScholarPubMed
Hedderich, M.A., Lange, L., Adel, H., Strötgen, J. and Klakow, D. (2020). A survey on recent approaches for natural language processing in low-resource scenarios. arXiv preprint arXiv:2010.12309.Google Scholar
Heimerl, F., Lohmann, S., Lange, S. and Ertl, T. (2014). Word cloud explorer: Text analytics based on word clouds. In 2014 Forty-Seventh Hawaii International Conference on System Sciences. IEEE, pp. 18331842.CrossRefGoogle Scholar
Henry, T. (2016). Quick ngram processing script. Available from: https://github.com/trh3/NGramProcessing.Google Scholar
Hernández, M.A. and Stolfo, S.J. (1998). Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery 2(1), 937.CrossRefGoogle Scholar
Hickman, L., Thapa, S., Tay, L., Cao, M. and Srinivasan, P. (2020). Text preprocessing for text mining in organizational research: Review and recommendations. In Organizational Research Methods, pp. 158.Google Scholar
Hiraoka, T., Shindo, H. and Matsumoto, Y. (2019). Stochastic tokenization with a language model for neural text classification. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 16201629.CrossRefGoogle Scholar
Hoffman, B. (1996). Translating into free word order languages. In Proceedings of the 16th Conference on Computational Linguistics-Volume 1, pp. 556561.CrossRefGoogle Scholar
Hofherr, P.C. (2012). Preposition-determiner amalgams in German and French at the syntax-morphology interface. In Ackema, P. (ed), Comparative Germanic Syntax: The State of the Art, Amsterdam, Netherlands. John Benjamins Publishing Company, pp. 99132.CrossRefGoogle Scholar
Houvardas, J. and Stamatatos, E. (2006). N-gram feature selection for authorship identification. In International Conference on Artificial Intelligence: Methodology, Systems, and Applications. Springer, pp. 7786.CrossRefGoogle Scholar
Huang, H. and Zhang, B. (2009). Text segmentation. In Liu, L. and Özsu, M.T. (eds), Encyclopedia of Database Systems, vol. 6. Springer, pp. 30723075.CrossRefGoogle Scholar
Huston, S., Culpepper, J.S. and Croft, W.B. (2014). Indexing word sequences for ranked retrieval. ACM Transactions on Information Systems (TOIS) 32(1), 126.CrossRefGoogle Scholar
Huston, S., Moffat, A. and Croft, W.B. (2011). Efficient indexing of repeated n-grams. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, pp. 127136.CrossRefGoogle Scholar
Hutto, C. and Gilbert, E. (2014). VADER: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the International AAAI Conference on Web and Social Media, vol. 8, pp. 216225. AAAI stands for Association for the Advancement of Artificial Intelligence.CrossRefGoogle Scholar
Hvitfeldt, E. and Silge, J. (2021). Supervised Machine Learning for Text Analysis in $\mathsf{R}$ . New York, NY, USA: Chapman and Hall/CRC.CrossRefGoogle Scholar
Indurkhya, N. and Damerau, F.J. (2010). Handbook of Natural Language Processing. New York, NY, USA: Chapman and Hall/CRC.CrossRefGoogle Scholar
Irfan, R., King, C.K., Grages, D., Ewen, S., Khan, S.U., Madani, S.A., Kolodziej, J., Wang, L., Chen, D. and Rayes, A. (2015). A survey on text mining in social networks. The Knowledge Engineering Review 30(2), 157170.CrossRefGoogle Scholar
Jean-Baptiste, E. (1916). Gammes sténographiques. Technical report, Institut Sténographique de France, Paris.Google Scholar
Ježek, E. (2016). The Lexicon: An Introduction. Oxford, UK: Oxford University Press.Google Scholar
Ji, Y. and Eisenstein, J. (2014). Representation learning for text-level discourse parsing. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 1324.CrossRefGoogle Scholar
Ji, Z., Wei, Q. and Xu, H. (2020). BERT-based ranking for biomedical entity normalization. AMIA Summits on Translational Science Proceedings, 2020, 269277. AMIA stands for American Medical Informatics Association.Google Scholar
Jiang, J. and Zhai, C. (2007). An empirical study of tokenization strategies for biomedical information retrieval. Information Retrieval 10(4–5), 341363.CrossRefGoogle Scholar
Jiang, Z., Li, L., Huang, D. and Jin, L. (2015). Training word embeddings for deep learning in biomedical text mining tasks. In 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, pp. 625628.CrossRefGoogle Scholar
Jimenez, M., Maxime, C., Le Traon, Y. and Papadakis, M. (2018). On the impact of tokenizer and parameters on n-gram based code analysis. In 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, pp. 437448.CrossRefGoogle Scholar
Jivani, A.G. (2011). A comparative study of stemming algorithms. International Journal of Computer Technology and Applications (IJCTA) 2(6), 19301938.Google Scholar
Jones, B.E. (1994). Exploring the role of punctuation in parsing natural text. In Proceedings of the Fifteenth Conference on Computational Linguistics, vol. 1. Association for Computational Linguistics, pp. 421425.CrossRefGoogle Scholar
Jones, R. (2006). Internet Slang Dictionary. Durham NC, USA: Lulu.com.Google Scholar
Joty, S., Carenini, G., Ng, R. and Murray, G. (2019). Discourse analysis and its applications. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, pp. 1217.CrossRefGoogle Scholar
Joulin, A., Grave, E., Bojanowski, P. and Mikolov, T. (2016). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.Google Scholar
Jurafsky, D. and Martin, J.H. (2008). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 2nd Edn., Fort Collins, CO, USA: Prentice Hall Series in Artificial Intelligence.Google Scholar
Jusoh, S. (2018). A study on NLP applications and ambiguity problems. Journal of Theoretical & Applied Information Technology 96(6), 1486–1499.Google Scholar
Kadhim, A.I. (2018). An evaluation of preprocessing techniques for text classification. International Journal of Computer Science and Information Security (IJCSIS) 16(6), 2232.Google Scholar
Kaewphan, S., Mehryary, F., Hakala, K., Salakoski, T. and Ginter, F. (2017). TurkuNLP entry for interactive Bio-ID assignment. In Proceedings of the BioCreative VI Workshop, pp. 3235.Google Scholar
Kaiser, E. and Trueswell, J.C. (2004). The role of discourse context in the processing of a flexible word-order language. Cognition 94(2), 113147.CrossRefGoogle ScholarPubMed
Kalra, V. and Aggarwal, R. (2018). Importance of text data preprocessing & implementation in RapidMiner. In Proceedings of the First International Conference on Information Technology and Knowledge Management (ICITKM), vol. 14, pp. 7175.Google Scholar
Kamps, J., Adafre, S.F. and De Rijke, M. (2004). Effective translation, tokenization and combination for cross-lingual retrieval. In Workshop of the Cross-Language Evaluation Forum for European Languages. Springer, pp. 123134.Google Scholar
Kannan, S. and Gurusamy, V. (2014). Preprocessing techniques for text mining. International Journal of Computer Science & Communication Networks 5(1), 716.Google Scholar
Kao, A. and Poteet, S.R. (2007). Natural Language Processing and Text Mining. London, UK: Springer Science & Business Media.CrossRefGoogle Scholar
Karthikeyan, S., Jotheeswaran, J., Balamurugan, B. and Chatterjee, J.M. (2020). Text mining. In Natural Language Processing in Artificial Intelligence, pp. 167210.CrossRefGoogle Scholar
Kathuria, A., Gupta, A. and Singla, R. (2021). A review of tools and techniques for preprocessing of textual data. In Computational Methods and Data Engineering, pp. 407422.CrossRefGoogle Scholar
Kaufmann, M. and Kalita, J. (2010). Syntactic normalization of Twitter messages. In International Conference on Natural Language Processing, Kharagpur, India. Google Scholar
Kaur, J. and Buttar, P.K. (2018). A systematic review on stopword removal algorithms. International Journal on Future Revolution in Computer Science & Communication Engineering 4(4), 207–210.Google Scholar
Kaviani, M. and Rahmani, H. (2020). EmHash: Hashtag recommendation using neural network based on BERT embedding. In 2020 6th International Conference on Web Research (ICWR). IEEE, pp. 113118.CrossRefGoogle Scholar
Khyani, D., Siddhartha, B., Niveditha, N. and Divya, B. (2020). An interpretation of lemmatization and stemming in natural language processing. Journal of University of Shanghai for Science and Technology 22(10), 350357.Google Scholar
Kim, K.H. and Zhu, Y. (2019). Researching Translation in the Age of Technology and Global Conflict: Selected Works of Mona Baker. Abingdon, UK: Routledge.CrossRefGoogle Scholar
Kiss, T. and Strunk, J. (2006). Unsupervised multilingual sentence boundary detection. Computational Linguistics 32(4), 485525.CrossRefGoogle Scholar
Kitaev, N., Cao, S. and Klein, D. (2019). Multilingual constituency parsing with self-attention and pre-training. arXiv preprint arXiv:1812.11760.Google Scholar
Koehn, P., Arun, A. and Hoang, H. (2008). Towards better machine translation quality for the German-English language pairs. In Proceedings of the Third Workshop on Statistical Machine Translation, pp. 139142.CrossRefGoogle Scholar
Korde, V. and Mahender, C.N. (2012). Text classification and classifiers: A survey. International Journal of Artificial Intelligence & Applications 3(2), 85.CrossRefGoogle Scholar
Korenius, T., Laurikkala, J., Järvelin, K. and Juhola, M. (2004). Stemming and lemmatization in the clustering of Finnish text documents. In Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, pp. 625633.CrossRefGoogle Scholar
Koto, F. and Adriani, M. (2015). A comparative study on Twitter sentiment analysis: Which features are good? In International Conference on Applications of Natural Language to Information Systems. Springer, pp. 453457.Google Scholar
Koulali, R. and Meziane, A. (2011). Topic detection and multi-word terms extraction for Arabic unvowelized documents. In Asia Information Retrieval Symposium. Springer, pp. 614623.CrossRefGoogle Scholar
Kozakou, E. (2017). Word Adaptions in the Language of Twitter. Master’s Thesis, Leiden University, Leiden, Netherlands.Google Scholar
Kudo, T. and Richardson, J. (2018). Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226.Google Scholar
Kulkarni, A. and Shivananda, A. (2019). Natural Language Processing Recipes. Berkeley, CA, USA: Springer.CrossRefGoogle Scholar
Kunilovskaya, M. and Plum, A. (2021). Text preprocessing and its implications in a digital humanities project. In Proceedings of the Student Research Workshop Associated with RANLP 2021, pp. 8593. RANLP stands for International Conference on Recent Advances in Natural Language Processing.CrossRefGoogle Scholar
Kutuzov, A., Fares, M., Oepen, S. and Velldal, E. (2017). Word vectors, reuse, and replicability: Towards a community repository of large-text resources. In Proceedings of the 58th Conference on Simulation and Modelling, Linköping, Sweden. Linköping University Electronic Press, pp. 271276.Google Scholar
Kwak, H., Lee, C., Park, H. and Moon, S. (2010). What is Twitter, a social network or a news media? In Proceedings of the Nineteenth International Conference on World Wide Web. ACM, pp. 591600.Google Scholar
Kwartler, T. (2017). Text Mining in Practice with $\mathsf{R}$ . Hoboken, NJ, USA: John Wiley & Sons.CrossRefGoogle Scholar
Labusch, M., Kotz, S.A. and Perea, M. (2022). The impact of capitalized German words on lexical access. Psychological Research 2022(86), 891902.CrossRefGoogle Scholar
Lafferty, J.D. and Blei, D.M. (2006). Correlated topic models. In Advances in Neural Information Processing Systems, pp. 147154.Google Scholar
Lahiri, S. and Mihalcea, R. (2013). Using n-gram and word network features for native language identification. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 251259.Google Scholar
Lambert, N.J. (2017). Text mining tutorial. In Group Processes. Edinburgh, Scotland, UK: Springer, Cham, pp. 93117.CrossRefGoogle Scholar
Lambert, P. and Banchs, R.E. (2006). Grouping multi-word expressions according to part-of-speech in statistical machine translation. In Proceedings of the Workshop on Multi-Word-Expressions in a Multilingual Context.Google Scholar
Lan, Y. (1996). Chomskian deep structure and translation. Perspectives: Studies in Translatology 4(1), 103113.CrossRefGoogle Scholar
Lazarinis, F. (2007). Evaluating the searching capabilities of e-commerce web sites in a non-English language. Online Information Review 31(6), 881–891.CrossRefGoogle Scholar
Le, H.P. and Ho, T.V. (2008). A maximum entropy approach to sentence boundary detection of Vietnamese texts. In IEEE International Conference on Research, Innovation and Vision for the Future-RIVF 2008.Google Scholar
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H. and Kang, J. (2020). BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4), 12341240.CrossRefGoogle ScholarPubMed
Leek, J.T. and Jager, L.R. (2017). Is most published research really false? Annual Review of Statistics and Its Application 4, 109122.CrossRefGoogle Scholar
Lende, S.P. and Raghuwanshi, M. (2016). Question answering system on education acts using NLP techniques. In 2016 World Conference on Futuristic Trends in Research and Innovation for Social Welfare (Startup Conclave). IEEE, pp. 16.CrossRefGoogle Scholar
Lextek International n.d. Onix text retrieval toolkit – API reference. Available from: https://www.lextek.com/manuals/onix/ [last accessed August 2019].Google Scholar
Li, C., Duan, Y., Wang, H., Zhang, Z., Sun, A. and Ma, Z. (2017). Enhancing topic modeling for short texts with auxiliary word embeddings. ACM Transactions on Information Systems (TOIS) 36(2), 130.CrossRefGoogle Scholar
Li, C., Wang, H., Zhang, Z., Sun, A. and Ma, Z. (2016). Topic modeling for short texts with auxiliary word embeddings. In Proceedings of the Thirty-Ninth International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, pp. 165174.CrossRefGoogle Scholar
Li, L., Shao, Y., Song, D., Qiu, X. and Huang, X. (2020). Generating adversarial examples in Chinese texts using sentence-pieces. arXiv preprint arXiv:2012.14769.Google Scholar
Li, X. and Croft, W.B. (2001). Incorporating syntactic information in question answering. Technical report, Center for Intelligent Information Retrieval, University of Massachusetts at Amherst, Amherst MA, USA.CrossRefGoogle Scholar
Lin, C.-Y. and Hovy, E. (2003). Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 150157.CrossRefGoogle Scholar
Lin, J., Nogueira, R. and Yates, A. (2020). Pretrained transformers for text ranking: BERT and beyond. arXiv preprint arXiv:2010.06467.Google Scholar
Lita, L.V., Ittycheriah, A., Roukos, S. and Kambhatla, N. (2003). tRuEcasIng. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pp. 152159.CrossRefGoogle Scholar
Litvak, M. and Vanetik, N. (2019). Multilingual Text Analysis: Challenges, Models, and Approaches. Singapore: World Scientific.CrossRefGoogle Scholar
Liu, B. (2015). Sentiment Analysis: Mining Opinions, Sentiments, and Emotions. Cambridge, UK: Cambridge University Press.CrossRefGoogle Scholar
Liu, H., Christiansen, T., Baumgartner, W.A. and Verspoor, K. (2012). BioLemmatizer: A lemmatization tool for morphological processing of biomedical text. Journal of Biomedical Semantics 3(1), 129.CrossRefGoogle ScholarPubMed
Liu, L. and Özsu, M.T. (2009). Encyclopedia of Database Systems, vol. 6. New York, NY, USA: Springer.CrossRefGoogle Scholar
Liu, Z. and Jansen, B.J. (2013). Factors influencing the response rate in social question and answering behavior. In Proceedings of the 2013 Conference on Computer Supported Cooperative Work, pp. 12631274.CrossRefGoogle Scholar
Lo, R.T.-W., He, B. and Ounis, I. (2005). Automatically building a stopword list for an information retrieval system. In Journal on Digital Information Management: Special Issue on the 5th Dutch-Belgian Information Retrieval Workshop (DIR), vol. 5, pp. 1724.Google Scholar
Loginova, E., Varanasi, S. and Neumann, G. (2018). Towards multilingual neural question answering. In European Conference on Advances in Databases and Information Systems. Springer, pp. 274285.Google Scholar
Los, B. and Lubbers, T. (2019). Syntax, text type, genre and authorial voice in old English: A data-driven approach. In Grammar–Discourse–Context: Grammar and Usage in Language Variation and Change, vol. 23, p. 49.CrossRefGoogle Scholar
Lourentzou, I., Manghnani, K. and Zhai, C. (2019). Adapting sequence to sequence models for text normalization in social media. In Proceedings of the International AAAI Conference on Web and Social Media, vol. 13, pp. 335345. AAAI stands for Association for the Advancement of Artificial Intelligence.CrossRefGoogle Scholar
Lovins, J.B. (1968). Development of a stemming algorithm. Mechanical Translation and Computational Linguistics 11(1–2), 2231.Google Scholar
Lüdeling, A. and Kytö, M. (2008). Corpus Linguistics, vol. 1. Berlin, Germany: Walter de Gruyter.CrossRefGoogle Scholar
Luhn, H.P. (1959). Keyword-in-context index for technical literature (KWIC index). American Documentation 11(4), 288295.CrossRefGoogle Scholar
Luiz, O.J., Olden, J.D., Kennard, M.J., Crook, D.A., Douglas, M.M., Saunders, T.M. and King, A.J. (2019). Trait-based ecology of fishes: A quantitative assessment of literature trends and knowledge gaps using topic modelling. Fish and Fisheries 20(6), 11001110.CrossRefGoogle Scholar
Luo, J., Tinsley, J. and Lepage, Y. (2013). Exploiting parallel corpus for handling out-of-vocabulary words. In Proceedings of the 27th Pacific Asia Conference on Language, Information, and Computation (PACLIC 27), pp. 399408.Google Scholar
Lusetti, M., Ruzsics, T., Göhring, A., Samardžić, T. and Stark, E. (2018). Encoder-decoder methods for text normalization. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018). Association for Computational Linguistics, pp. 1828.Google Scholar
Makrehchi, M. and Kamel, M.S. (2008). Automatic extraction of domain-specific stopwords from labeled documents. In European Conference on Information Retrieval. Springer, pp. 222233.CrossRefGoogle Scholar
Makrehchi, M. and Kamel, M.S. (2017). Extracting domain-specific stopwords for text classifiers. Intelligent Data Analysis 21(1), 3962.CrossRefGoogle Scholar
Malvern, D. and Richards, B. (2012). Measures of lexical richness. In The Encyclopedia of Applied Linguistics.CrossRefGoogle Scholar
Manning, C., Raghavan, P. and Schütze, H. (2008). Introduction to Information Retrieval. Cambridge, UK: Cambridge University Press.CrossRefGoogle Scholar
Mansurov, B. and Mansurov, A. (2021). Uzbek Cyrillic-Latin-Cyrillic machine transliteration. arXiv preprint arXiv:2101.05162.Google Scholar
Marcu, D. (2000). The Theory and Practice of Discourse Parsing and Summarization. Cambridge, MA, USA: MIT Press.CrossRefGoogle Scholar
Masini, F. (2005). Multi-word expressions between syntax and the lexicon: The case of Italian verb-particle constructions. SKY Journal of Linguistics 18(2005), 145173. SKY stands for Suomen kielitieteellinen yhdistys, from the Linguistic Association of Finland. Google Scholar
Masini, F. (2019). Multi-word expressions and morphology. In Oxford Research Encyclopedia of Linguistics.CrossRefGoogle Scholar
Matusov, E., Leusch, G., Bender, O. and Ney, H. (2005). Evaluating machine translation output with automatic sentence segmentation. In International Workshop on Spoken Language Translation (IWSLT).Google Scholar
McAuliffe, J.D. and Blei, D.M. (2008). Supervised topic models. In Advances in Neural Information Processing Systems, pp. 121128.Google Scholar
McNamee, P. and Mayfield, J. (2007). N-gram morphemes for retrieval. In CLEF (Working Notes). CLEF stands for the Cross-Language Evaluation Forum workshop.Google Scholar
Mendoza, M., Poblete, B. and Castillo, C. (2010). Twitter under crisis: Can we trust what we RT? In Proceedings of the First Workshop on Social Media Analytics. ACM, pp. 7179.Google Scholar
Merlo, P. (2019). Probing word and sentence embeddings for long-distance dependencies effects in French and English. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 158172.CrossRefGoogle Scholar
Metzler, H., Baginski, H., Niederkrotenthaler, T. and Garcia, D. (2021). Detecting potentially harmful and protective suicide-related content on Twitter: A machine learning approach. arXiv preprint arXiv:2112.04796.Google Scholar
Meyer, D., Hornik, K. and Feinerer, I. (2008). Text mining infrastructure in $\mathsf{R}$ . Journal of Statistical Software 25(5), 154.Google Scholar
Miah, M. (2009). Improved k-NN algorithm for text classification. In Proceedings of the 2009 International Conference on Data Mining (DMIN). Citeseer, pp. 434440.Google Scholar
Mieke, S.S. (2016). Language diversity in ACL 2004–2016. ACL stands for the annual meeting of the Association for Computational Linguistics. Available from: https://sjmielke.com/acl-language-diversity.htm.Google Scholar
Mikolov, T., Chen, K., Corrado, G. and Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.Google Scholar
Miller, J.-A. (2014). Language: Much ado about what? In Lacan and the Subject of Language, pp. 2135.Google Scholar
Mitrofan, M. and Tufiş, D. (2018). Bioro: The biomedical corpus for the Romanian language. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).Google Scholar
Moh, T.-S. and Zhang, Z. (2012). Cross-lingual text classification with model translation and document translation. In Proceedings of the 50th Annual Southeast Regional Conference, pp. 7176.CrossRefGoogle Scholar
Moon, S. and Okazaki, N. (2020). Jamo pair encoding: Subcharacter representation-based extreme Korean vocabulary compression for efficient subword tokenization. In Proceedings of the 12th Language Resources and Evaluation Conference, pp. 34903497.Google Scholar
Moon, S. and Okazaki, N. (2021). Effects and mitigation of out-of-vocabulary in universal language models. Journal of Information Processing 29, 490503.CrossRefGoogle Scholar
Morante, R. and Blanco, E. (2021). Recent advances in processing negation. Natural Language Engineering 27(2), 110.Google Scholar
Morgan, A.A., Lu, Z., Wang, X., Cohen, A.M., Fluck, J., Ruch, P., Divoli, A., Fundel, K., Leaman, R. and Hakenberg, J. (2008). Overview of BioCreative II gene normalization. Genome Biology 9(2), 119.CrossRefGoogle ScholarPubMed
Mostafa, M.M. (2013). More than words: Social networks’ text mining for consumer brand sentiments. Expert Systems with Applications 40(10), 42414251.CrossRefGoogle Scholar
Mubarak, H. (2017). Build fast and accurate lemmatization for Arabic. arXiv preprint arXiv:1710.06700.Google Scholar
Mullen, L.A., Benoit, K., Keyes, O., Selivanov, D. and Arnold, J. (2018). Fast, consistent tokenization of natural language text. Journal of Open Source Software 3(23), 655.CrossRefGoogle Scholar
Munro, R. (2015). Language at ACL this year. ACL stands for the annual conference of the Association for Computational Linguistics. Available from: http://www.junglelightspeed.com/languages-at-acl-this-year/.Google Scholar
Namly, D., Bouzoubaa, K., El Jihad, A. and Aouragh, S.L. (2020). Improving Arabic lemmatization through a lemmas database and a machine-learning technique. In Recent Advances in NLP: The Case of Arabic Language. Springer, pp. 81100.CrossRefGoogle Scholar
Narala, S., Rani, B.P. and Ramakrishna, K. (2017). Telugu text categorization using language models. Global Journal of Computer Science and Technology 16(4), 9–13.Google Scholar
Nasir, A., Aslam, K., Tariq, S. and Ullah, M.F. (2020). Predicting mental illness using social media posts and comments. International Journal of Advanced Computer Science and Applications 11(12), 607613.Google Scholar
Nayak, A.S., Kanive, A., Chandavekar, N. and Balasubramani, R. (2016). Survey on pre-processing techniques for text mining. International Journal of Engineering and Computer Science 5(6) 23197242.Google Scholar
Nayel, H.A., Shashirekha, H., Shindo, H. and Matsumoto, Y. (2019). Improving multi-word entity recognition for biomedical texts. arXiv preprint arXiv:1908.05691.Google Scholar
Névéol, A., Robert, A., Anderson, R., Cohen, K.B., Grouin, C., Lavergne, T., Rey, G., Rondet, C. and Zweigenbaum, P. (2017). CLEF eHealth 2017 multilingual information extraction task overview: ICD10 coding of death certificates in English and French. In CLEF (Working Notes). CLEF stands for Conference and Labs of the Evaluation Forum.Google Scholar
Newman, M.E. (2005). Power laws, Pareto distributions and Zipf’s law. Contemporary Physics 46(5), 323351.CrossRefGoogle Scholar
Ng, D., Bansal, M. and Curran, J.R. (2015). Web-scale surface and syntactic n-gram features for dependency parsing. arXiv preprint arXiv:1502.07038.Google Scholar
Nguyen, D.Q., Billingsley, R., Du, L. and Johnson, M. (2015). Improving topic models with latent feature word representations. Transactions of the Association for Computational Linguistics 3, 299313.CrossRefGoogle Scholar
Nicolai, G. and Kondrak, G. (2016). Leveraging inflection tables for stemming and lemmatization. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 11381147.CrossRefGoogle Scholar
Nivre, J. (2005). Dependency grammar and dependency parsing. MSI Report 5133(1959), 132. MSI report is from Växjö University, Växjö, Sweden.Google Scholar
Nothman, J., Qin, H. and Yurchak, R. (2018). Stop word lists in free open-source software packages. In Proceedings of Workshop for NLP Open Source Software (NLP-OSS), pp. 712.CrossRefGoogle Scholar
Novák, A. and Siklósi, B. (2015). Automatic diacritics restoration for Hungarian. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 22862291.CrossRefGoogle Scholar
Nugent, R. (2020). Instead of just teaching data science, let’s understand how and why people do it. Symposium on Data Science and Statistics. Abstract available from: https://ww2.amstat.org/meetings/sdss/2020/onlineprogram/AbstractDetails.cfm?AbstractID=308230.Google Scholar
Nunberg, G. (1990). The Linguistics of Punctuation , vol. 18. Center for the Study of Language (CSLI), Stanford, CA, USA: Stanford University.Google Scholar
Nuzumlal, M.Y. and Özgür, A. (2014). Analyzing stemming approaches for Turkish multi-document summarization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 702–706.CrossRefGoogle Scholar
Olde, B.A., Hoeffner, J., Chipman, P., Graesser, A.C. and Research Group, Tutoring (1999). A connectionist model for part of speech tagging. In Florida Artificial Intelligence Research Society (FLAIRS) Conference, pp. 172176.Google Scholar
Paice, C. and Hooper, R. (2005). Lancaster stemmer. Available from: http://www.comp.lancs.ac.uk/computing/research/stemming/[last accessed August 2019].Google Scholar
Pais, V., Ion, R., Avram, A.-M., Mitrofan, M. and Tufis, D. (2021). In-depth evaluation of Romanian natural language processing pipelines. Romanian Journal of Information Science and Technology 24(4), 384401.Google Scholar
Pak, A. and Paroubek, P. (2010). Twitter as a corpus for sentiment analysis and opinion mining. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC), pp. 13201326.Google Scholar
Pantel, P. and Pennacchiotti, M. (2006). Espresso: Leveraging generic patterns for automatically harvesting semantic relations. In Proceedings of the Twenty-First International Conference on Computational Linguistics and the Forty-Fourth Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, pp. 113120.CrossRefGoogle Scholar
Papakyriakopoulos, O., Hegelich, S., Serrano, J.C.M. and Marco, F. (2020). Bias in word embeddings. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 446457.CrossRefGoogle Scholar
Pareti, S., O’Keefe, T., Konstas, I., Curran, J.R. and Koprinska, I. (2013). Automatically detecting and attributing indirect quotations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 989999.Google Scholar
Patil, A., Pharande, K., Nale, D. and Agrawal, R. (2015). Automatic text summarization. International Journal of Computer Applications 109(17), 18–19.CrossRefGoogle Scholar
Patro, G.K., Chakraborty, A., Ganguly, N. and Gummadi, K.P. (2020). On fair virtual conference scheduling: Achieving equitable participant and speaker satisfaction. arXiv preprint arXiv:2010.14624.Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M. and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 28252830.Google Scholar
Peitz, S., Freitag, M., Mauser, A. and Ney, H. (2011). Modeling punctuation prediction as machine translation. In International Workshop on Spoken Language Translation (IWSLT).Google Scholar
Peldszus, A. and Stede, M. (2015). Joint prediction in MST-style discourse parsing for argumentation mining. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 938948. MST stands for minimum spanning tree.CrossRefGoogle Scholar
Pennell, D. and Liu, Y. (2011). Toward text message normalization: Modeling abbreviation generation. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 53645367.CrossRefGoogle Scholar
Pennington, J., Socher, R. and Manning, C.D. (2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 15321543.CrossRefGoogle Scholar
Perkins, J. (2014). Python 3 Text Processing with NLTK 3 Cookbook. Birmingham, UK: Packt Publishing Ltd.Google Scholar
Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K. and Zettlemoyer, L. (2018). Deep contextualized word representations. arXiv preprint arXiv:1802.05365.Google Scholar
Petic, M. and Gîfu, D. (2014). Transliteration and alignment of parallel texts from Cyrillic to Latin. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp. 18191823.Google Scholar
Piktus, A., Edizel, N.B., Bojanowski, P., Grave, É., Ferreira, R. and Silvestri, F. (2019). Misspelling oblivious word embeddings. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 32263234.CrossRefGoogle Scholar
Poddar, L. (2016). Multilingual multiword expressions. arXiv preprint arXiv:1612.00246.Google Scholar
Polignano, M., Basile, P., De Gemmis, M., Semeraro, G. and Basile, V. (2019). Alberto: Italian BERT language understanding model for NLP challenging tasks based on Tweets. In 6th Italian Conference on Computational Linguistics, CLiC-it 2019, vol. 2481. CEUR Workshop Proceedings, pp. 16.Google Scholar
Poria, S., Agarwal, B., Gelbukh, A., Hussain, A. and Howard, N. (2014). Dependency-based semantic parsing for concept-level text analysis. In International Conference on Intelligent Text Processing and Computational Linguistics. Springer, pp. 113127.Google Scholar
Porter, M.F. (1980). An algorithm for suffix stripping. Program 14(3), 130137.CrossRefGoogle Scholar
Posada, J.D., Barda, A.J., Shi, L., Xue, D., Ruiz, V., Kuan, P.-H., Ryan, N.D. and Tsui, F.R. (2017). Predictive modeling for classification of positive valence system symptom severity from initial psychiatric evaluation records. Journal of Biomedical Informatics 75, S94S104.CrossRefGoogle Scholar
Potts, C. (n.d). Sentiment symposium tutorial: Stemming. Available from: http://sentiment.christopherpotts.net/stemming.html [last accessed August 2019].Google Scholar
Preethi, V. (2021). Survey on text transformation using Bi-LSTM in natural language processing with text data. Turkish Journal of Computer and Mathematics Education (TURCOMAT) 12(9), 25772585.Google Scholar
Przepiórkowski, A., Piasecki, M., Jassem, K. and Fuglewicz, P. (2012). Computational Linguistics: Applications, vol. 458. Heidelberg, Germany: Springer.Google Scholar
Ptaszynski, M., Lempa, P., Masui, F., Kimura, Y., Rzepka, R., Araki, K., Wroczynski, M. and Leliwa, G. (2019). Brute-force sentence pattern extortion from harmful messages for cyberbullying detection. Journal of the Association for Information Systems 20(8), 4.Google Scholar
Putra, S.J., Gunawan, M.N. and Suryatno, A. (2018). Tokenization and n-gram for indexing Indonesian translation of the Quran. In 2018 6th International Conference on Information and Communication Technology (ICoICT). IEEE, pp. 158161.CrossRefGoogle Scholar
Qian, Z., Li, P., Zhu, Q., Zhou, G., Luo, Z. and Luo, W. (2016). Speculation and negation scope detection via convolutional neural networks. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 815825.CrossRefGoogle Scholar
Qudar, M. and Mago, V. (2020). A survey on language models. Technical report, Lakehead University, Thunder Bay, Ontario, Canada.Google Scholar
Raff, E., Fleming, W., Zak, R., Anderson, H., Finlayson, B., Nicholas, C. and McLean, M. (2019). Kilograms: Very large n-grams for malware classification. arXiv preprint arXiv:1908.00200.Google Scholar
Raff, E. and Nicholas, C. (2018). Hash-grams: Faster n-gram features for classification and malware detection. In Proceedings of the ACM Symposium on Document Engineering 2018, pp. 14.CrossRefGoogle Scholar
Rahm, E. and Do, H.H. (2000). Data cleaning: Problems and current approaches. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 23(4), 313.Google Scholar
Rahman, R. (2017). Detecting emotion from text and emoticon. London Journal of Research in Computer Science and Technology 17(3), 913.Google Scholar
Rajaraman, A. and Ullman, J.D. (2011). Mining of Massive Datasets. Cambridge, UK: Cambridge University Press.CrossRefGoogle Scholar
Rajput, B.S. and Khare, N. (2015). A survey of stemming algorithms for information retrieval. International Organization of Scientific Research – Journal of Computer Engineering 17(3), 7680.Google Scholar
Raulji, J.K. and Saini, J.R. (2016). Stop-word removal algorithm and its implementation for Sanskrit language. International Journal of Computer Applications 150(2), 1517.Google Scholar
Reber, U. (2019). Overcoming language barriers: Assessing the potential of machine translation and topic modeling for the comparative analysis of multilingual text corpora. Communication Methods and Measures 13(2), 102125.CrossRefGoogle Scholar
Řehůřek, R. and Sojka, P. (2010). Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 4550. LREC stands for Conference on Language Resources and Evaluation.Google Scholar
Reynar, J.C. and Ratnaparkhi, A. (1997). A maximum entropy approach to identifying sentence boundaries. In Proceedings of the Fifth Conference on Applied Natural Language Processing. Association for Computational Linguistics, pp. 1619.CrossRefGoogle Scholar
Richardson, L. (2020). Beautiful Soup 4.9.1. Python library. Available from: https://www.crummy.com/software/BeautifulSoup/.Google Scholar
Rikters, M. and Bojar, O. (2017). Paying attention to multi-word expressions in neural machine translation. arXiv preprint arXiv:1710.06313.Google Scholar
Rinker, T.W. (2018a). lexicon: Lexicon Data. $\mathsf{R}$ package version 1.2.1. Available from: http://github.com/trinker/lexicon.Google Scholar
Rinker, T.W. (2018b). textclean: Text Cleaning Tools. $\mathsf{R}$ package version 0.9.3. Available from: https://github.com/trinker/textclean.Google Scholar
Roberts, M.E., Stewart, B.M., Tingley, D., Lucas, C., Leder-Luis, J., Gadarian, S.K., Albertson, B. and Rand, D.G. (2014). Structural topic models for open-ended survey responses. American Journal of Political Science 58(4), 10641082.CrossRefGoogle Scholar
Robinson, D. (2018). gutenbergr: Download and Process Public Domain Works from Project Gutenberg. $\mathsf{R}$ package version 0.1.4. Available from: https://CRAN.R-project.org/package=gutenbergr.Google Scholar
Rosenthal, S. and McKeown, K. (2013). Columbia NLP: Sentiment detection of subjective phrases in social media. In Second Joint Conference on Lexical and Computational Semantics (SEM): Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), vol. 2, pp. 478482.Google Scholar
Rosid, M.A., Fitrani, A.S., Astutik, I.R.I., Mulloh, N.I. and Gozali, H.A. (2020). Improving text preprocessing for student complaint document classification using Sastrawi. In IOP Conference Series: Materials Science and Engineering, vol. 874, 012017, Bristol, UK: IOP Publishing.CrossRefGoogle Scholar
Saal, F.E., Downey, R.G. and Lahey, M.A. (1980). Rating the ratings: Assessing the psychometric quality of rating data. Psychological Bulletin 88(2), 413.CrossRefGoogle Scholar
Sadvilkar, N. and Neumann, M. (2020). PySBD: Pragmatic sentence boundary disambiguation. arXiv preprint arXiv:2010.09657.Google Scholar
Sag, I.A., Baldwin, T., Bond, F., Copestake, A. and Flickinger, D. (2002). Multiword expressions: A pain in the neck for NLP. In International Conference on Intelligent Text Processing and Computational Linguistics. Springer, pp. 115.Google Scholar
Sailer, M. and Markantonatou, S. (2018). Multiword Expressions: Insights from a Multi-Lingual Perspective. Berlin, Germany: Language Science Press.Google Scholar
Samad, M.D., Khounviengxay, N.D. and Witherow, M.A. (2020). Effect of text processing steps on Twitter sentiment classification using word embedding. arXiv preprint arXiv:2007.13027.Google Scholar
Samir, A. and Lahbib, Z. (2018). Stemming and lemmatization for information retrieval systems in Amazigh language. In International Conference on Big Data, Cloud and Applications. Springer, pp. 222233.CrossRefGoogle Scholar
Sánchez-Vega, F., Villatoro-Tello, E., Montes-y Gómez, M., Rosso, P., Stamatatos, E. and Villaseñor-Pineda, L. (2019). Paraphrase plagiarism identification with character-level features. Pattern Analysis and Applications 22(2), 669681.CrossRefGoogle Scholar
Sandhaus, E. (2008). The New York Times annotated corpus. Linguistic Data Consortium, Philadelphia 6(12), e26752. Available from: https://catalog.ldc.upenn.edu/LDC2008T19.Google Scholar
Sarica, S. and Luo, J. (2020). Stopwords in technical language processing. arXiv preprint arXiv:2006.02633.Google Scholar
Sarkar, D. (2019). Text Analytics with Python: A Practitioner’s Guide to Natural Language Processing. Bangalore, India: Apress.CrossRefGoogle Scholar
Schofield, A., Magnusson, M. and Mimno, D. (2017). Pulling out the stops: Rethinking stopword removal for topic models. In Proceedings of the Fifteenth Conference of the European Chapter of the Association for Computational Linguistics, vol. 2, pp. 432436.CrossRefGoogle Scholar
Schuman, H. and Presser, S. (1996). Questions and Answers in Attitude Surveys: Experiments on Question Form, Wording, and Context. Thousand Oaks, CA, USA: Sage.Google Scholar
Schuur, Y. (2020). Normalization for Dutch for Improved POS Tagging. Master’s Thesis, University of Groningen, Groningen, Netherlands.Google Scholar
Şeker, G.A. and Eryiğit, G. (2012). Initial explorations on using CRFs for Turkish named entity recognition. In Proceedings of COLING 2012, pp. 24592474. CRFs stand for conditional random fields, and COLING stands for International Conference on Computational Linguistics.Google Scholar
Seretan, V. (2011). Syntax-Based Collocation Extraction, vol. 44. Heidelberg, Germany: Springer Science & Business Media.CrossRefGoogle Scholar
Shaikh, A., More, D., Puttoo, R., Shrivastav, S. and Shinde, S. (2019). A survey paper on chatbots. International Research Journal of Engineering and Technology (IRJET) 6(4), 1786–1789.Google Scholar
Shalaby, W., Zadrozny, W. and Jin, H. (2019). Beyond word embeddings: Learning entity and concept representations from large scale knowledge bases. Information Retrieval Journal 22(6), 525542.CrossRefGoogle Scholar
Silge, J. and Robinson, D. (2017). Text Mining with $\mathsf{R}$ : A Tidy Approach. Sebastopol, CA, USA: O’Reilly Media Inc.Google Scholar
Sinclair, J. (2018). Collins English Dictionary, 13th Edn. New York NY, USA: HarperCollins.Google Scholar
Singh, V. and Saini, B. (2014). An effective tokenization algorithm for information retrieval systems. In First International Conference on Data Mining, pp. 109119.CrossRefGoogle Scholar
Singhal, A. (2001). Modern information retrieval: A brief overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 24(4), 3543.Google Scholar
Slioussar, N. (2011). Processing of a free word order language: The role of syntax and context. Journal of Psycholinguistic Research 40(4), 291306.CrossRefGoogle ScholarPubMed
Søgaard, A., de Lhoneux, M. and Augenstein, I. (2018). Nightmare at test time: How punctuation prevents parsers from generalizing. arXiv preprint arXiv:1809.00070.Google Scholar
Soriano, J., Au, T. and Banks, D. (2013). Text mining in computational advertising. Statistical Analysis and Data Mining: The ASA Data Science Journal 6(4), 273285. ASA stands for American Statistical Association.CrossRefGoogle Scholar
Soricut, R. and Marcu, D. (2003). Sentence level discourse parsing using syntactic and lexical information. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 228235.CrossRefGoogle Scholar
Stamatatos, E. (2008). Author identification: Using text sampling to handle the class imbalance problem. Information Processing & Management 44(2), 790799.CrossRefGoogle Scholar
Stamatatos, E. (2011). Plagiarism detection using stopword n-grams. Journal of the American Society for Information Science and Technology 62(12), 25122527.CrossRefGoogle Scholar
Sweeney, K. (2020). Unsupervised Machine Learning for Conference Scheduling: A Natural Language Processing Approach based on Latent Dirichlet Allocation. Master’s Thesis, NHH Norwegian School of Economics, Bergen, Norway.Google Scholar
Tabii, Y., Lazaar, M., Al Achhab, M. and Enneya, N. (2018). Big Data, Cloud and Applications (BDCA): Third International Conference, BDCA 2018, Kenitra, Morocco, 4–5 April 2018, Revised Selected Papers, vol, 872. Springer.CrossRefGoogle Scholar
Taboada, M., Meizoso, M., Martínez, D., Riano, D. and Alonso, A. (2013). Combining open-source natural language processing tools to parse clinical practice guidelines. Expert Systems 30(1), 311.CrossRefGoogle Scholar
Tan, F., Hu, Y., Hu, C., Li, K. and Yen, K. (2020). TNT: Text normalization based pre-training of transformers for content moderation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 47354741.CrossRefGoogle Scholar
Tan, L. and Pal, S. (2014). Manawi: Using multi-word expressions and named entities to improve machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 201206.CrossRefGoogle Scholar
Tang, C., Ross, K., Saxena, N. and Chen, R. (2011). What’s in a name: A study of names, gender inference, and gender behavior in Facebook. In International Conference on Database Systems for Advanced Applications. Springer, pp. 344356.CrossRefGoogle Scholar
Tang, J., Hong, M., Zhang, D.L. and Li, J. (2008). Information extraction: Methodologies and applications. In Emerging Technologies of Text Mining: Techniques and Applications. USA: IGI Global, pp. 133.CrossRefGoogle Scholar
Temnikova, I., Nikolova, I., Baumgartner, W.A. Jr, Angelova, G. and Cohen, K.B. (2013). Closure properties of Bulgarian clinical text. In Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP), pp. 667675.Google Scholar
Thanaki, J. (2017). Python Natural Language Processing. Birmingham, UK: Packt Publishing Ltd.Google Scholar
Thione, G. and van den Berg, M. (2007). Systems and methods for structural indexing of natural language text. Google Patents. US Patent Application No. 11/405,385.Google Scholar
Toral, A., Pecina, P., Wang, L. and Van Genabith, J. (2015). Linguistically-augmented perplexity-based data selection for language models. Computer Speech & Language 32(1), 1126.CrossRefGoogle Scholar
Torres-Moreno, J.-M. (2012). Beyond stemming and lemmatization: Ultra-stemming to improve automatic text summarization. arXiv preprint arXiv:1209.3126.Google Scholar
Torres-Moreno, J.-M. (2014). Automatic Text Summarization. Hoboken, NJ, USA: John Wiley & Sons.CrossRefGoogle Scholar
Tran, D. and Sharma, D. (2005). Markov models for written language identification. In Proceedings of the 12th International Conference on Neural Information Processing, pp. 6770.Google Scholar
Trieschnigg, D., Kraaij, W. and de Jong, F. (2007). The influence of basic tokenization on biomedical document retrieval. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 803804.CrossRefGoogle Scholar
Truss, L. (2004). Eats, Shoots & Leaves: The Zero Tolerance Approach to Punctuation. New York, NY, USA: Penguin.Google Scholar
Truss, L. (2006). Eats, Shoots & Leaves: Why, Commas Really do make a Difference! New York, NY, USA: Penguin.Google Scholar
Tsarfaty, R., Seddah, D., Goldberg, Y., Kübler, S., Candito, M., Foster, J., Versley, Y., Rehbein, I. and Tounsi, L. (2010). Statistical parsing of morphologically rich languages (SPMRL): What, how and whither. In Proceedings of the NAACL-HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages. Association for Computational Linguistics, pp. 112. NAACL-HLT stands for the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.Google Scholar
Venhuizen, N.J., Bos, J., Hendriks, P. and Brouwer, H. (2018). Discourse semantics with information structure. Journal of Semantics 35(1), 127169.CrossRefGoogle Scholar
Verma, R.M. and Marchette, D.J. (2019). Cybersecurity Analytics. New York, NY, USA: CRC Press.CrossRefGoogle Scholar
Verspoor, C.M., Joslyn, C. and Papcun, G.J. (2003). The gene ontology as a source of lexical semantic knowledge for a biological natural language processing application. In SIGIR Workshop on Text Analysis and Search for Bioinformatics, pp. 5156.Google Scholar
Verspoor, K., Dvorkin, D., Cohen, K.B. and Hunter, L. (2009). Ontology quality assurance through analysis of term transformations. Bioinformatics 25(12), i77i84.CrossRefGoogle ScholarPubMed
Vieweg, S., Hughes, A.L., Starbird, K. and Palen, L. (2010). Microblogging during two natural hazards events: What Twitter may contribute to situational awareness. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, pp. 10791088. SIGCHI stands for Special Interest Group on Computer-Human Interaction.CrossRefGoogle Scholar
Vijayarani, S., Ilamathi, J. and Nithya, V. (2015). Preprocessing techniques for text mining: An overview. International Journal of Computer Science & Communication Networks 5(1), 716.Google Scholar
Vilares, J., Vilares, M., Alonso, M.A. and Oakes, M.P. (2016). On the feasibility of character n-grams pseudo-translation for cross-language information retrieval tasks. Computer Speech & Language 36, 136164.CrossRefGoogle Scholar
Villares-Maldonado R. (2017). A corpus-based analysis of non-standard English features in the microblogging platform Tumblr. EPiC Series in Language and Linguistics 2, 295303.CrossRefGoogle Scholar
Wallach, H.M. (2006). Topic modeling: Beyond bag-of-words. In Proceedings of the Twenty-Third International Conference on Machine Learning. ACM, pp. 977984.CrossRefGoogle Scholar
Wang, H., Liu, L., Song, W. and Lu, J. (2014a). Feature-based sentiment analysis approach for product reviews. Journal of Software 9(2), 274279.Google Scholar
Wang, W., Gu, Z. and Tang, K. (2020). An improved algorithm for Bert. In Proceedings of the 2020 International Conference on Cyberspace Innovation of Advanced Technologies, pp. 116120.CrossRefGoogle Scholar
Wang, X., McCallum, A. and Wei, X. (2007). Topical n-grams: Phrase and topic discovery, with an application to information retrieval. In Seventh IEEE International Conference on Data Mining (ICDM 2007). IEEE, pp. 697702.CrossRefGoogle Scholar
Wang, X., Tokarchuk, L. and Poslad, S. (2014b). Identifying relevant event content for real-time event detection. In 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014), pp. 395398.CrossRefGoogle Scholar
Wang, Y., Liu, S., Afzal, N., Rastegar-Mojarad, M., Wang, L., Shen, F., Kingsbury, P. and Liu, H. (2018). A comparison of word embeddings for the biomedical natural language processing. Journal of Biomedical Informatics 87, 1220.CrossRefGoogle ScholarPubMed
Webster, J.J. and Kit, C. (1992). Tokenization as the initial phase in NLP. In Proceedings of the 14th Conference on Computational Linguistics, pp. 11061110.CrossRefGoogle Scholar
Wei, Z., Miao, D., Chauchat, J.-H., Zhao, R. and Li, W. (2009). N-grams based feature selection and text representation for Chinese text classification. International Journal of Computational Intelligence Systems 2(4), 365374.Google Scholar
Welbers, K., Van Atteveldt, W. and Benoit, K. (2017). Text analysis in $\mathsf{R}$ . Communication Methods and Measures 11(4), 245265.CrossRefGoogle Scholar
Wermter, J. and Hahn, U. (2005). Paradigmatic modifiability statistics for the extraction of complex multi-word terms. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pp. 843850.CrossRefGoogle Scholar
White, L., Togneri, R., Liu, W. and Bennamoun, M. (2018). Neural Representations of Natural Language, vol. 783. Singapore: Springer.Google Scholar
White, R.W. and Morris, D. (2007). Investigating the querying and browsing behavior of advanced search engine users. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 255262.CrossRefGoogle Scholar
Widyassari, A.P., Rustad, S., Shidik, G.F., Noersasongko, E., Syukur, A. and Affandy, A. (2022). Review of automatic text summarization techniques & methods. Journal of King Saud University-Computer and Information Sciences 34(4), 1029–1046.CrossRefGoogle Scholar
Wijffels, J. (2021). udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the ‘UDPipe’ ‘NLP’ Toolkit. $\mathsf{R}$ package version 0.8.8. Available from: https://cran.r-project.org/web/packages/udpipe/udpipe.pdf.Google Scholar
Wilson, K.S. (2009). Database search control. US Patent Application No. 12/031,701.Google Scholar
Wong, D.F., Chao, L.S. and Zeng, X. (2014). iSentenizer- $\mu$ : Multilingual sentence boundary detection model. The Scientific World Journal 2014, 110.Google Scholar
Woo, H., Kim, J. and Lee, W. (2020). Validation of text data preprocessing using a neural network model. Mathematical Problems in Engineering 2020, 1–9.CrossRefGoogle Scholar
Xue, J., Chen, J., Hu, R., Chen, C., Zheng, C. and Zhu, T. (2020). Twitter discussions and concerns about COVID-19 pandemic: Twitter data analysis using a machine learning approach. arXiv preprint arXiv: 2005.12830.Google Scholar
Yamada, I., Takeda, H. and Takefuji, Y. (2015). Enhancing named entity recognition in Twitter messages using entity linking. In Proceedings of the Workshop on Noisy User-Generated Text, pp. 136140.CrossRefGoogle Scholar
Yang, Y. and Wilbur, J. (1996). Using corpus statistics to remove redundant words in text categorization. Journal of the American Society for Information Science 47(5), 357369.3.0.CO;2-V>CrossRefGoogle Scholar
Yazdani, A., Ghazisaeedi, M., Ahmadinejad, N., Giti, M., Amjadi, H. and Nahvijou, A. (2020). Automated misspelling detection and correction in Persian clinical text. Journal of Digital Imaging 33(3), 555–562.CrossRefGoogle ScholarPubMed
Ye, X., Shen, H., Ma, X., Bunescu, R. and Liu, C. (2016). From word embeddings to document similarities for improved information retrieval in software engineering. In Proceedings of the 38th International Conference on Software Engineering, pp. 404415.CrossRefGoogle Scholar
Yeh, K.C. (2003). Bilingual sentence alignment based on punctuation marks. In Proceedings of the ROCLING 2003 Student Workshop, pp. 303312. ROCLING stands for Conference on Computational Linguistics and Speech Processing.Google Scholar
Yin, D., Ren, F., Jiang, P. and Kuroiwa, S. (2007). Chinese complex long sentences processing method for Chinese-Japanese machine translation. In 2007 International Conference on Natural Language Processing and Knowledge Engineering. IEEE, pp. 170175.CrossRefGoogle Scholar
Yolchuyeva, S., Németh, G. and Gyires-Tóth, B. (2018). Text normalization with convolutional neural networks. International Journal of Speech Technology 21(3), 589600.CrossRefGoogle Scholar
Yuhang, Z., Yue, W. and Wei, Y. (2010). Research on data cleaning in text clustering. In 2010 International Forum on Information Technology and Applications, vol. 1. IEEE, pp. 305307.CrossRefGoogle Scholar
Zaman, A., Matsakis, P. and Brown, C. (2011). Evaluation of stop word lists in text retrieval using latent semantic indexing. In 2011 Sixth International Conference on Digital Information Management. IEEE, pp. 133136.CrossRefGoogle Scholar
Zeng, C., Li, S., Li, Q., Hu, J. and Hu, J. (2020). A survey on machine reading comprehension–tasks, evaluation metrics and benchmark datasets. Applied Sciences 10(21), 7640.CrossRefGoogle Scholar
Zeng, Z., Xu, H., Chong, T.Y., Chng, E.-S. and Li, H. (2017). Improving N-gram language modeling for code-switching speech recognition. In 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, pp. 15961601.CrossRefGoogle Scholar
Zhang, C., Baldwin, T., Ho, H., Kimelfeld, B. and Li, Y. (2013). Adaptive parser-centric text normalization. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 11591168.Google Scholar
Zhang, H. (2008). Exploring regularity in source code: Software science and Zipf’s law. In 2008 15th Working Conference on Reverse Engineering. IEEE, pp. 101110.CrossRefGoogle Scholar
Zhang, J. and El-Gohary, N.M. (2017). Integrating semantic NLP and logic reasoning into a unified system for fully-automated code checking. Automation in Construction 73, 4557.CrossRefGoogle Scholar
Zhang, L. and Komachi, M. (2018). Neural machine translation of logographic languages using sub-character level information. arXiv preprint arXiv:1809.02694.Google Scholar
Zhang, X. and LeCun, Y. (2017). Which encoding is the best for text classification in Chinese, English, Japanese and Korean? arXiv preprint arXiv:1708.02657.Google Scholar
Zhang, Y., Chen, Q., Yang, Z., Lin, H. and Lu, Z. (2019). BioWordVec, improving biomedical word embeddings with subword information and MeSH. Scientific Data 6(1), 19.CrossRefGoogle ScholarPubMed
Zhou, L., Siddiqui, T., Seliger, S.L., Blumenthal, J.B., Kang, Y., Doerfler, R. and Fink, J.C. (2019). Text preprocessing for improving hypoglycemia detection from clinical notes – A case study of patients with diabetes. International Journal of Medical Informatics 129, 374380.CrossRefGoogle ScholarPubMed
Zhou, M., Duan, N., Liu, S. and Shum, H.-Y. (2020). Progress in neural NLP: Modeling, learning, and reasoning. Engineering 6(3), 275290.CrossRefGoogle Scholar
Zipf, G.K. (1949). Human behavior and the principle of least effort.Google Scholar
Zubiaga, A., Spina, D., Marítnez, R. and Fresno, V. (2015). Real-time classification of Twitter trends. Journal of the Association for Information Science and Technology 66(3), 462473.CrossRefGoogle Scholar
Zupon, A., Crew, E. and Ritchie, S. (2021). Text normalization for low-resource languages of Africa. arXiv preprint arXiv:2103.15845.Google Scholar
Zweigenbaum, P. and Grabar, N. (2002a). Accenting unknown words: Application to the French version of the MeSH. In Workshop NLP in Biomedical Applications, EFMI, Cyprus, 69á. EFMI stands for European Federation for Medical Informatics.Google Scholar
Zweigenbaum, P. and Grabar, N. (2002b). Restoring accents in unknown biomedical words: Application to the French MeSH thesaurus. International Journal of Medical Informatics 67(1–3), 113126.CrossRefGoogle Scholar