Comparison of text preprocessing methods

Christine P. Chai

doi:10.1017/S1351324922000213

Comparison of text preprocessing methods

Published online by Cambridge University Press: 13 June 2022

Christine P. Chai

Show author details

Christine P. Chai*: Affiliation:
Microsoft Corporation, One Microsoft Way, Redmond, WA 98052, USA
*: E-mail: chrchai@microsoft.com

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Text preprocessing is not only an essential step to prepare the corpus for modeling but also a key area that directly affects the natural language processing (NLP) application results. For instance, precise tokenization increases the accuracy of part-of-speech (POS) tagging, and retaining multiword expressions improves reasoning and machine translation. The text corpus needs to be appropriately preprocessed before it is ready to serve as the input to computer models. The preprocessing requirements depend on both the nature of the corpus and the NLP application itself, that is, what researchers would like to achieve from analyzing the data. Conventional text preprocessing practices generally suffice, but there exist situations where the text preprocessing needs to be customized for better analysis results. Hence, we discuss the pros and cons of several common text preprocessing methods: removing formatting, tokenization, text normalization, handling punctuation, removing stopwords, stemming and lemmatization, n-gramming, and identifying multiword expressions. Then, we provide examples of text datasets which require special preprocessing and how previous researchers handled the challenge. We expect this article to be a starting guideline on how to select and fine-tune text preprocessing methods.

Keywords

Data preprocessing Parsing Text data mining

Type: Survey Paper
Information: Natural Language Engineering , Volume 29 , Issue 3 , May 2023 , pp. 509 - 553

DOI: https://doi.org/10.1017/S1351324922000213 [Opens in a new window]
Copyright: © The Author(s), 2022. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abd-Alrazaq, A., Alhuwail, D., Househ, M., Hamdi, M. and Shah, Z. (2020). Top concerns of Tweeters during the COVID-19 pandemic: Infoveillance study. Journal of Medical Internet Research (JMIR) 22(4), e19016.CrossRef Google Scholar PubMed

Abdelali, A., Darwish, K., Durrani, N. and Mubarak, H. (2016). Farasa: A fast and furious segmenter for Arabic. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pp. 11–16.CrossRef Google Scholar

Abdolahi, M. and Zahedh, M. (2017). Sentence matrix normalization using most likely n-grams vector. In 2017 IEEE 4th International Conference on Knowledge-Based Engineering and Innovation (KBEI). IEEE, pp. 0040–0045.CrossRef Google Scholar

Abraham, A., Dutta, P., Mandal, J.K., Bhattacharya, A. and Dutta, S. (2018). Emerging Technologies in Data Mining and Information Security: Proceedings of IEMIS 2018, Volume 2, vol. 813. Springer. IEMIS Stands for International Conference on Emerging Technologies in Data Mining and Information Security.Google Scholar

Acree, B.D. (2016). Deep Learning and Ideological Rhetoric. PhD Dissertation, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.Google Scholar

Ács, J., Kádár, Á. and Kornai, A. (2021). Subword pooling makes a difference. arXiv preprint arXiv:2102.10864.Google Scholar

Adnan, K. and Akbar, R. (2019a). An analytical study of information extraction from unstructured and multidimensional big data. Journal of Big Data 6(1), 1–38.CrossRef Google Scholar

Adnan, K. and Akbar, R. (2019b). Limitations of information extraction methods and techniques for heterogeneous unstructured big data. International Journal of Engineering Business Management 11, 1–23.CrossRef Google Scholar

Aggarwal, C.C. and Zhai, C. (2012). Mining Text Data. Boston, MA, USA: Springer Science & Business Media.CrossRef Google Scholar

Agić, Ž., Merkler, D. and Berović, D. (2013). Parsing Croatian and Serbian by using Croatian dependency treebanks. In Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages, pp. 22–33.Google Scholar

Agre, G., van Genabith, J. and Declerck, T. (2018). Artificial Intelligence: Methodology, Systems, and Applications: 18th International Conference, AIMSA 2018, Varna, Bulgaria, 12–14 September 2018, Proceedings, vol. 11089. Springer.CrossRef Google Scholar

Akhtar, A.K., Sahoo, G. and Kumar, M. (2017). Digital corpus of Santali language. In 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI). IEEE, pp. 934–938.CrossRef Google Scholar

Akkasi, A., Varoğlu, E. and Dimililer, N. (2016). ChemTok: A new rule based tokenizer for chemical named entity recognition. BioMed Research International 2016, 1–9.CrossRef Google Scholar PubMed

Al-Khafaji, H.K. and Habeeb, A.T. (2017). Efficient algorithms for preprocessing and stemming of tweets in a sentiment analysis system. International Organization of Scientific Research – Journal of Computer Engineering 9(3), 44–50.Google Scholar

Al-Molegi, A., Alsmadi, I., Najadat, H. and Albashiri, H. (2015). Automatic learning of Arabic text categorization. International Journal of Digital Contents and Applications 2(1), 1–16.Google Scholar

Al Sharou, K., Li, Z. and Specia, L. (2021). Towards a better understanding of noise in natural language processing. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP), pp. 53–62.CrossRef Google Scholar

Albalawi, R., Yeap, T.H. and Benyoucef, M. (2020). Using topic modeling methods for short-text data: A comparative analysis. Frontiers in Artificial Intelligence 3, 42.CrossRef Google Scholar PubMed

Albishre, K., Albathan, M. and Li, Y. (2015). Effective 20 newsgroups dataset cleaning. In 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), vol. 3. IEEE, pp. 98–101.CrossRef Google Scholar

Aliwy, A.H. (2012). Tokenization as preprocessing for Arabic tagging system. International Journal of Information and Education Technology 2(4), 348–353.CrossRef Google Scholar

Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E.D., Gutierrez, J.B. and Kochut, K. (2017). A brief survey of text mining: Classification, clustering and extraction techniques. arXiv preprint arXiv:1707.02919.Google Scholar

Alonso, M.A., Gómez-Rodríguez, C. and Vilares, J. (2021). On the use of parsing for named entity recognition. Applied Sciences 11(3), 1090.CrossRef Google Scholar

Amarasinghe, K., Manic, M. and Hruska, R. (2015). Optimal stop word selection for text mining in critical infrastructure domain. In 2015 Resilience Week (RWS). IEEE, pp. 1–6.CrossRef Google Scholar

Anandarajan, M., Hill, C. and Nolan, T. (2019). Text preprocessing. In Practical Text Analytics. Springer, pp. 45–59.CrossRef Google Scholar

Anderwald, L. (2003). Negation in Non-Standard British English: Gaps, Regularizations and Asymmetries. London, UK: Routledge.CrossRef Google Scholar

Angiani, G., Ferrari, L., Fontanini, T., Fornacciari, P., Iotti, E., Magliani, F. and Manicardi, S. (2016). A comparison between preprocessing techniques for sentiment analysis in Twitter. In Proceedings of the 2nd International Workshop on Knowledge Discovery on the WEB (KDWeb).Google Scholar

Arens, R. (2004). A preliminary look into the use of named entity information for bioscience text tokenization. In Proceedings of the Student Research Workshop at HLT-NAACL 2004, pp. 37–42. HLT-NAACL stands for Human Language Technologies – North American Chapter of the Association for Computational Linguistics.CrossRef Google Scholar

Arief, M. and Deris, M.B.M. (2021). Text preprocessing impact for sentiment classification in product review. In 2021 Sixth International Conference on Informatics and Computing (ICIC). IEEE, pp. 1–7.CrossRef Google Scholar

Armano, G., Fanni, F. and Giuliani, A. (2015). Stopwords identification by means of characteristic and discriminant analysis. In ICAART (2), pp. 353–360. ICCART stands for International Conference on Agents and Artificial Intelligence.CrossRef Google Scholar

Armengol Estapé, J. (2021). A Pipeline for Large Raw Text Preprocessing and Model Training of Language Models at Scale. Master’s Thesis, Universitat Politècnica de Catalunya, Barcelona, Spain.Google Scholar

Arora, C., Sabetzadeh, M., Briand, L. and Zimmer, F. (2016). Extracting domain models from natural-language requirements: Approach and industrial evaluation. In Proceedings of the ACM/IEEE 19th International Conference on Model Driven Engineering Languages and Systems, pp. 250–260.CrossRef Google Scholar

Arun, R., Suresh, V. and Madhavan, C.V. (2009). Stopword graphs and authorship attribution in text corpora. In 2009 IEEE International Conference on Semantic Computing. IEEE, pp. 192–196.CrossRef Google Scholar

Ashok, A., Elmasri, R. and Natarajan, G. (2019). Comparing different word embeddings for multiword expression identification. In International Conference on Applications of Natural Language to Information Systems. Springer, pp. 295–302.Google Scholar

Attia, M. (2007). Arabic tokenization system. In Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources, pp. 65–72.CrossRef Google Scholar

Atwood, J. (2008). Stack Overflow Podcast Episode 32. Available from: https://stackoverflow.blog/2008/12/04/podcast-32/.Google Scholar

Au, T.C. (2014). Topics in Computational Advertising. PhD Dissertation, Duke University, Durham NC, USA.Google Scholar

Aye, T.T. (2011). Web log cleaning for mining of web usage patterns. In 2011 3rd International Conference on Computer Research and Development, vol. 2. IEEE, pp. 490–494.CrossRef Google Scholar

Ayral, H. and Yavuz, S. (2011). An automated domain specific stop word generation method for natural language text classification. In 2011 International Symposium on Innovations in Intelligent Systems and Applications. IEEE, pp. 500–503.CrossRef Google Scholar

Babar, S. and Patil, P.D. (2015). Improving performance of text summarization. Procedia Computer Science 46, 354–363.CrossRef Google Scholar

Baerman, M. (2015). The Oxford Handbook of Inflection. Oxford, UK: Oxford University Press.CrossRef Google Scholar

Baldwin, R.S. and Coady, J.M. (1978). Psycholinguistic approaches to a theory of punctuation. Journal of Reading Behavior 10(4), 363–375.CrossRef Google Scholar

Baldwin, T. and Kim, S.N. (2010). Multiword expressions. Handbook of Natural Language Processing 2, 267–292.Google Scholar

Baldwin, T. and Li, Y. (2015). An in-depth analysis of the effect of text normalization in social media. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 420–429.CrossRef Google Scholar

Bao, Y., Quan, C., Wang, L. and Ren, F. (2014). The role of pre-processing in Twitter sentiment analysis. In International Conference on Intelligent Computing. Springer, pp. 615–624.Google Scholar

Bardoel, T. (2012). Comparing n-gram frequency distributions. Technical report, Tilburg center for Cognition and Communication (TiCC), Tilburg University, Tilburg, Netherlands.Google Scholar

Barr, C., Jones, R. and Regelson, M. (2008). The linguistic structure of English web-search queries. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pp. 1021–1030.CrossRef Google Scholar

Barrett, N. and Weber-Jahnke, J. (2011). Building a biomedical tokenizer using the token lattice design pattern and the adapted Viterbi algorithm. BMC Bioinformatics 12(3), S1.CrossRef Google Scholar PubMed

Barrón-Cedeño, A. and Rosso, P. (2009). On automatic plagiarism detection based on n-grams comparison. In European Conference on Information Retrieval. Springer, pp. 696–700.CrossRef Google Scholar

Barteld, F. (2017). Detecting spelling variants in non-standard texts. In Proceedings of the Student Research Workshop at the Fifteenth Conference of the European Chapter of the Association for Computational Linguistics, pp. 11–22.CrossRef Google Scholar

Basta, C., Costa-jussà, M.R. and Casas, N. (2021). Extensive study on the underlying gender bias in contextualized word embeddings. Neural Computing and Applications 33(8), 3371–3384.CrossRef Google Scholar

Batrinca, B. and Treleaven, P.C. (2015). Social media analytics: A survey of techniques, tools and platforms. AI & Society 30(1), 89–116.CrossRef Google Scholar

Battenberg, E. (2012). Ratings prediction using linear regression on text reviews. Technical report, Berkeley Institute of Design (BID), University of California – Berkeley, Berkeley CA, United States.Google Scholar

Beaufays, F. and Strope, B. (2013). Language model capitalization. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, pp. 6749–6752.CrossRef Google Scholar

Beaver, I. (2019). pycontractions 2.0.1. Python library. Available from: https://pypi.org/project/pycontractions/.Google Scholar

Beitzel, S.M., Jensen, E.C., Lewis, D.D., Chowdhury, A. and Frieder, O. (2007). Automatic classification of web queries using very large unlabeled query logs. ACM Transactions on Information Systems (TOIS) 25(2), 1–29.CrossRef Google Scholar

Bekkerman, R. and Allan, J. (2004). Using bigrams in text categorization. Technical report, IR-408, Center of Intelligent Information Retrieval, University of Massachusetts at Amherst, Amherst MA, United States.Google Scholar

Bender, E.M. (2011). On achieving and evaluating language-independence in NLP. Linguistic Issues in Language Technology 6(3), 1–26.CrossRef Google Scholar

Bender, E.M. and Friedman, B. (2018). Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics 6, 587–604.CrossRef Google Scholar

Bengfort, B., Bilbro, R. and Ojeda, T. (2018). Applied Text Analysis with Python: Enabling Language-Aaware Data Products with Machine Learning. Sebastopol, CA, USA: O’Reilly Media Inc.Google Scholar

Bengio, Y., Ducharme, R., Vincent, P. and Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research 3, 1137–1155.Google Scholar

Benoit, K., Muhr, D. and Watanabe, K. (2017). stopwords: Multilingual Stopword Lists.

$\mathsf{R}$ package version 0.9.0. Available from: https://CRAN.R-project.org/package=stopwords.Google Scholar

Benoit, K., Watanabe, K., Wang, H., Nulty, P., Obeng, A., Müller, S. and Matsuo, A. (2018). quanteda: An

$\mathsf{R}$ package for the quantitative analysis of textual data. Journal of Open Source Software 3(30), 774. Available from: https://quanteda.io.CrossRef Google Scholar

Berberich, K. and Bedathur, S. (2013). Computing n-gram statistics in MapReduce. In Proceedings of the 16th International Conference on Extending Database Technology, pp. 101–112.CrossRef Google Scholar

Berend, G. (2018). L1 regularization of word embeddings for multi-word expression identification. Acta Cybernetica 23(3), 801–813.CrossRef Google Scholar

Bergmanis, T. and Goldwater, S. (2018). Context sensitive neural lemmatization with Lematus. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 1391–1400.CrossRef Google Scholar

Bernardy, J.-P. and Chatzikyriakidis, S. (2019). What kind of natural language inference are NLP systems learning: Is this enough? In ICAART (2), pp. 919–931. ICAART stands for International Conference on Agents and Artificial Intelligence.Google Scholar

Bhasuran, B., Murugesan, G., Abdulkadhar, S. and Natarajan, J. (2016). Stacked ensemble combined with fuzzy matching for biomedical named entity recognition of diseases. Journal of Biomedical Informatics 64, 1–9.CrossRef Google Scholar PubMed

Bi, Y. (2016). Scheduling Optimization with LDA and Greedy Algorithm. Master’s Thesis, Duke University, Durham NC, United States. LDA stands for latent Dirichlet allocation.Google Scholar

Bird, S., Loper, E. and Klein, E. (2009). Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. USA: O’Reilly Media Inc.Google Scholar

Blanchard, A. (2007). Understanding and customizing stopword lists for enhanced patent mapping. World Patent Information 29(4), 308–316.CrossRef Google Scholar

Blanco, E. and Moldovan, D. (2011). Some issues on detecting negation from text. In Proceedings of the Twenty-Fourth International Florida Artificial Intelligence Research Society (FLAIRS) Conference.Google Scholar

Blei, D.M. and Lafferty, J.D. (2009). Visualizing topics with multi-word expressions. arXiv preprint arXiv:0907.1013.Google Scholar

Blei, D.M., Ng, A.Y. and Jordan, M.I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993–1022.Google Scholar

Bodapati, S., Yun, H. and Al-Onaizan, Y. (2019). Robustness to capitalization errors in named entity recognition. arXiv preprint arXiv:1911.05241.Google Scholar

Bokka, K.R., Hora, S., Jain, T. and Wambugu, M. (2019). Deep Learning for Natural Language Processing: Solve your Natural Language Processing Problems with Smart Deep Neural Networks. Birmingham, UK: Packt Publishing Ltd.Google Scholar

Bollen, J., Mao, H. and Zeng, X. (2011). Twitter mood predicts the stock market. Journal of Computational Science 2(1), 1–8.CrossRef Google Scholar

Bollmann, M. (2013). POS tagging for historical texts with sparse training data. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pp. 11–18.Google Scholar

Bollmann, M. (2019). A large-scale comparison of historical text normalization systems. arXiv preprint arXiv:1904.02036.Google Scholar

Bouchet-Valat, M. (2019). SnowballC: Snowball Stemmers Based on the C ‘libstemmer’ UTF-8 Library.

$\mathsf{R}$ package version 0.6.0. Available from: https://CRAN.R-project.org/package=SnowballC.Google Scholar

Boukobza, R. and Rappoport, A. (2009). Multi-word expression identification using sentence surface features. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 468–477.CrossRef Google Scholar

Brants, T., Popat, A.C., Xu, P., Och, F.J. and Dean, J. (2007). Large language models in machine translation. Technical report, Google Research.Google Scholar

Briscoe, T. (1996). The syntax and semantics of punctuation and its use in interpretation. In Proceedings of the Association for Computational Linguistics Workshop on Punctuation, pp. 1–7.Google Scholar

Broder, A.Z., Glassman, S.C., Manasse, M.S. and Zweig, G. (1997). Syntactic clustering of the web. Computer Networks and ISDN Systems 29(8–13), 1157–1166. ISDN stands for Integrated Services Digital Network.CrossRef Google Scholar

Brooke, J., Hammond, A. and Hirst, G. (2015). GutenTag: An NLP-driven tool for digital humanities research in the Project Gutenberg corpus. In Proceedings of the Fourth Workshop on Computational Linguistics for Literature, pp. 42–47.CrossRef Google Scholar

Brown, P.F., Desouza, P.V., Mercer, R.L., Pietra, V. J.D. and Lai, J.C. (1992). Class-based n-gram models of natural language. Computational Linguistics 18(4), 467–479.Google Scholar

Buck, C., Heafield, K. and Van Ooyen, B. (2014). N-gram counts and language models from the common crawl. In LREC (International Conference on Language Resources and Evaluation), pp. 3579–3584.Google Scholar

Buerki, A. (2017). Frequency consolidation among word n-grams. In International Conference on Computational and Corpus-Based Phraseology. Springer, pp. 432–446.CrossRef Google Scholar

Cabot, C., Soualmia, L.F., Dahamna, B. and Darmoni, S.J. (2016). SIBM at CLEF eHealth Evaluation Lab 2016: Extracting concepts in French medical texts with ECMT and CIMIND. In CLEF (Working Notes), pp. 47–60. CLEF stands for Conference and Labs of the Evaluation Forum.Google Scholar

Cai, H., Yang, Y., Li, X. and Huang, Z. (2015). What are popular: Exploring Twitter features for event detection, tracking and visualization. In Proceedings of the 23rd ACM International Conference on Multimedia, pp. 89–98.CrossRef Google Scholar

Callan, J., Hoy, M., Yoo, C. and Zhao, L. (2009). The ClueWeb09 dataset. Available from: http://lemurproject.org/clueweb09/.Google Scholar

Camacho-Collados, J. and Pilehvar, M.T. (2018). On the role of text preprocessing in neural network architectures: An evaluation study on text categorization and sentiment analysis. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 40–46. EMNLP stands for Conference on Empirical Methods in Natural Language Processing.Google Scholar

Cambria, E., Poria, S., Gelbukh, A. and Thelwall, M. (2017). Sentiment analysis is a big suitcase. IEEE Intelligent Systems 32(6), 74–80.CrossRef Google Scholar

Campbell, W.M., Li, L., Dagli, C., Acevedo-Aviles, J., Geyer, K., Campbell, J.P. and Priebe, C. (2016). Cross-domain entity resolution in social media. arXiv preprint arXiv:1608.01386.Google Scholar

Carvalho, G., de Matos, D.M. and Rocio, V. (2007). Document retrieval for question answering: A quantitative evaluation of text preprocessing. In Proceedings of the ACM First PhD Workshop in CIKM, pp. 125–130. CIKM stands for Conference on Information and Knowledge Management.CrossRef Google Scholar

Chai, C.P. (2017). Statistical Issues in Quantifying Text Mining Performance. PhD Dissertation, Duke University, Durham NC, USA.Google Scholar

Chai, C.P. (2019). Text mining in survey data. Survey Practice 12(1), 1–13.CrossRef Google Scholar

Chai, C.P. (2020). The importance of data cleaning: Three visualization examples. CHANCE 33(1), 4–9.CrossRef Google Scholar

Chai, C.P. (2022). Word distinctivity – Quantifying improvement of topic modeling results from n-gramming. REVSTAT–Statistical Journal 20(2), 199–220.Google Scholar

Chang, J.P., Chiam, C., Fu, L., Wang, A.Z., Zhang, J. and Danescu-Niculescu-Mizil, C. (2020). Convokit: A toolkit for the analysis of conversations. arXiv preprint arXiv:2005.04246.Google Scholar

Chatterjee, S. and Krystyanczuk, M. (2017). Python Social Media Analytics. Birmingham, UK: Packt Publishing Ltd. Book excerpt. Available from: https://hub.packtpub.com/clean-social-media-data-analysis-python/.Google Scholar

Chaudhary, G. and Kshirsagar, M. (2018). Overview and application of text data pre-processing techniques for text mining on health news Tweets. Helix 8(5), 3764–3768.CrossRef Google Scholar

Chen, Q., Yao, L. and Yang, J. (2016). Short text classification based on LDA topic model. In 2016 International Conference on Audio, Language and Image Processing (ICALIP). IEEE, pp. 749–753. LDA stands for latent Dirichlet allocation.CrossRef Google Scholar

Chen, Y., Zhang, H., Liu, R., Ye, Z. and Lin, J. (2019). Experimental explorations on short text topic mining between LDA and NMF based schemes. Knowledge-Based Systems 163, 1–13. LDA stands for latent Dirichlet allocation, and NMF stands for non-negative matrix factorization. Google Scholar

Chomsky, N. (1969). Deep Structure, Surface Structure, and Semantic Interpretation. Indiana University Linguistics Club.Google Scholar

Chua, M., Van Esch, D., Coccaro, N., Cho, E., Bhandari, S. and Jia, L. (2018). Text normalization infrastructure that scales to hundreds of language varieties. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).Google Scholar

Cirqueira, D., Pinheiro, M.F., Jacob, A., Lobato, F. and Santana, Á. (2018). A literature review in preprocessing for sentiment analysis for Brazilian Portuguese social media. In 2018 IEEE/WIC/ACM International Conference on Web Intelligence (WI). IEEE, pp. 746–749.CrossRef Google Scholar

Clark, E. and Araki, K. (2011). Text normalization in social media: Progress, problems and applications for a pre-processing system of casual English. Procedia-Social and Behavioral Sciences 27, 2–11.CrossRef Google Scholar

Clough, P. (2001). A Perl program for sentence splitting using rules. Technical report, University of Sheffield, Sheffield, United Kingdom.Google Scholar

Cohen, K.B., Acquaah-Mensah, G.K., Dolbey, A.E. and Hunter, L. (2002). Contrast and variability in gene names. In Proceedings of the ACL-02 Workshop on Natural Language Processing in the Biomedical Domain, vol. 3. Association for Computational Linguistics (ACL), pp. 14–20.CrossRef Google Scholar

Cohen, K.B., Hunter, L.E. and Pressman, P.S. (2019). P-hacking lexical richness through definitions of “type” and “token”. Studies in Health Technology and Informatics 264, 1433–1434.Google Scholar PubMed

Cohen, K.B., Ogren, P.V., Fox, L. and Hunter, L. (2005). Empirical data on corpus design and usage in biomedical natural language processing. In American Medical Informatics Association (AMIA) Annual Symposium Proceedings, vol. 2005, p. 156.Google Scholar

Cohen, K.B., Roeder, C., Baumgartner, W. A. Jr, Hunter, L.E. and Verspoor, K. (2010). Test suite design for ontology concept recognition systems, pp. 441–446.Google Scholar

Cohen, K.B., Tanabe, L., Kinoshita, S. and Hunter, L. (2004). A resource for constructing customized test suites for molecular biology entity identification systems. In HLT-NAACL 2004 Workshop: Linking Biological Literature, Ontologies and Databases, pp. 1–8. HLT-NAACL stands for Human Language Technologies – North American Chapter of the Association for Computational Linguistics.Google Scholar

Congjun, L. and Hill, N.W. (2021). Recent developments in Tibetan NLP. Transactions on Asian and Low-Resource Language Information Processing 20(2), 1–3.CrossRef Google Scholar

Constant, M., Eryiğit, G., Monti, J., Van Der Plas, L., Ramisch, C., Rosner, M. and Todirascu, A. (2017). Multiword expression processing: A survey. Computational Linguistics 43(4), 837–892.CrossRef Google Scholar

Corbett, P. and Boyle, J. (2018). Improving the learning of chemical-protein interactions from literature using transfer learning and specialized word embeddings. Database 2018, 1–10.CrossRef Google Scholar PubMed

Corral, Á., Boleda, G. and Ferrer-i Cancho, R. (2015). Zipf’s law for word frequencies: Word forms versus lemmas in long texts. PLOS One 10(7), e0129031.CrossRef Google Scholar PubMed

Councill, I., McDonald, R. and Velikovich, L. (2010). What’s great and what’s not: Learning to classify the scope of negation for improved sentiment analysis. In Proceedings of the Workshop on Negation and Speculation in Natural Language Processing, pp. 51–59.Google Scholar

Craiger, P. and Shenoi, S. (2007). Advances in Digital Forensics III: IFIP International Conference on Digital Forensics, National Center for Forensic Science, Orlando Florida, 28–31 January 2007, vol. 242. Springer. IFIP stands for International Federation for Information Processing.CrossRef Google Scholar

Craven, T.C. (2004). Variations in use of meta tag keywords by web pages in different languages. Journal of Information Science 30(3), 268–279.CrossRef Google Scholar

CrowdFlower (2017). 2017 Data Science Report. Available from: https://visit.figure-eight.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport.pdf.Google Scholar

Crystal, D. (1999). The Penguin Dictionary of Language. London, UK: Penguin Group.Google Scholar

Cutrone, L. and Chang, M. (2011). Auto-assessor: Computerized assessment system for marking student’s short-answers automatically. In 2011 IEEE International Conference on Technology for Education. IEEE, pp. 81–88.CrossRef Google Scholar

Dařena, F. (2019). VecText: Converting documents to vectors. IAENG International Journal of Computer Science 46(2). IAENG stands for International Association of Engineers. Google Scholar

Das, B., Pal, S., Mondal, S.K., Dalui, D. and Shome, S.K. (2013). Automatic keyword extraction from any text document using N-gram rigid collocation. International Journal of Soft Computing and Engineering (IJSCE) 3(2), 238–242.Google Scholar

Denny, M.J. and Spirling, A. (2018). Text preprocessing for unsupervised learning: Why it matters, when it misleads, and what to do about it. Political Analysis 26(2), 168–189.CrossRef Google Scholar

Deoras, A., Mikolov, T. and Church, K. (2011). A fast re-scoring strategy to capture long-distance dependencies. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 1116–1127.Google Scholar

Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.Google Scholar

Daz, N.P.C. and López, M.J.M. (2015). An analysis of biomedical tokenization: Problems and strategies. In Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis, pp. 40–49.CrossRef Google Scholar

Dinov, I.D. (2018). Data Science and Predictive Analytics: Biomedical and Health Applications using

$\mathsf{R}$ . Ann Arbor, MI, USA: Springer.CrossRef Google Scholar

Dodge, J., Gururangan, S., Card, D., Schwartz, R. and Smith, N.A. (2019). Show your work: Improved reporting of experimental results. arXiv preprint arXiv:1909.03004.Google Scholar

Domingo, M., Garca-Martnez, M., Helle, A., Casacuberta, F. and Herranz, M. (2018). How much does tokenization affect neural machine translation? arXiv preprint arXiv:1812.08621.Google Scholar

Doraisamy, S. and Rüger, S. (2003). Robust polyphonic music retrieval with n-grams. Journal of Intelligent Information Systems 21(1), 53–70.CrossRef Google Scholar

Dressel, F. (2016). Distribution of n-grams in English text corpus. Available from: http://rpubs.com/fdd/187848.Google Scholar

Dror, R., Baumer, G., Shlomov, S. and Reichart, R. (2018). The hitchhiker’s guide to testing statistical significance in natural language processing. In Proceedings of the Fifty-Sixth Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 1383–1392.CrossRef Google Scholar

Du, J., Yu, P. and Zong, C. (2018). Towards computing technologies on machine parsing of English and Chinese garden path sentences. In Proceedings of the Future Technologies Conference. Springer, pp. 806–827.Google Scholar

Duan, J., Lu, R., Wu, W., Hu, Y. and Tian, Y. (2006). A bio-inspired approach for multi-word expression extraction. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions. Association for Computational Linguistics, pp. 176–182. COLING stands for International Conference on Computational Linguistics.CrossRef Google Scholar

Dunning, T. (1994). Statistical identification of language. In Computing Research Laboratory Technical Memo MCCS 94-273. New Mexico State University.Google Scholar

Duran, M.S., Avanço, L., Alusio, S., Pardo, T. and Nunes, M.d.G.V. (2014). Some issues on the normalization of a corpus of products reviews in Portuguese. In Proceedings of the 9th Web as Corpus Workshop (WaC-9), pp. 22–28.CrossRef Google Scholar

Effrosynidis, D., Symeonidis, S. and Arampatzis, A. (2017). A comparison of pre-processing techniques for Twitter sentiment analysis. In International Conference on Theory and Practice of Digital Libraries. Springer, pp. 394–406.CrossRef Google Scholar

Ek, A., Bernardy, J.-P. and Chatzikyriakidis, S. (2020). How does punctuation affect neural models in natural language inference. In Proceedings of the Probability and Meaning Conference (PaM 2020), pp. 109–116.Google Scholar

El-Khair, I.A. (2017). Effects of stop words elimination for Arabic information retrieval: A comparative study. arXiv preprint arXiv:1702.01925.Google Scholar

Elming, J., Johannsen, A., Klerke, S., Lapponi, E., Alonso, H.M. and Søgaard, A. (2013). Down-stream effects of tree-to-dependency conversions. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 617–626.Google Scholar

Fan, A., Doshi-Velez, F. and Miratrix, L. (2017). Promoting domain-specific terms in topic models with informative priors. arXiv preprint arXiv:1701.03227.Google Scholar

Fancellu, F., Lopez, A. and Webber, B. (2016). Neural networks for negation scope detection. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 495–504.CrossRef Google Scholar

Fancellu, F., Lopez, A., Webber, B. and He, H. (2017). Detecting negation scope is easy, except when it isn’t. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 58–63.CrossRef Google Scholar

Farzindar, A.A. and Inkpen, D. (2020). Natural language processing for social media. Synthesis Lectures on Human Language Technologies 13(2), 1–219.CrossRef Google Scholar

Fatyanosa, T.N. and Bachtiar, F.A. (2017). Classification method comparison on Indonesian social media sentiment analysis. In 2017 International Conference on Sustainable Information Engineering and Technology (SIET). IEEE, pp. 310–315.CrossRef Google Scholar

Feinerer, I. and Hornik, K. (2018). tm: Text Mining Package.

$\mathsf{R}$ package version 0.7.6. Available from: https://CRAN.R-project.org/package=tm.Google Scholar

Ferilli, S., Esposito, F. and Grieco, D. (2014). Automatic learning of linguistic resources for stopword removal and stemming from text. Procedia Computer Science 38, 116–123.CrossRef Google Scholar

Ferret, O., Grau, B., Hurault-Plantet, M., Illouz, G., Jacquemin, C., Monceaux, L. and Vilnat, A. (2002). How NLP can improve question answering. Knowledge Organization 29(3–4), 135–155.Google Scholar

Fesseha, A., Xiong, S., Emiru, E.D., Diallo, M. and Dahou, A. (2021). Text classification based on convolutional neural networks and word embedding for low-resource languages: Tigrinya. Information 12(2), 52.CrossRef Google Scholar

Finlayson, M. and Kulkarni, N. (2011). Detecting multi-word expressions improves word sense disambiguation. In Proceedings of the Workshop on Multiword Expressions: From Parsing and Generation to the Real World, pp. 20–24.Google Scholar

Fisher, N. (2013). Analytics for Leaders: A Performance Measurement System for Business Success. Cambridge, UK: Cambridge University Press.CrossRef Google Scholar

Fisher, N. (2019). A comprehensive approach to problems of performance measurement. Journal of the Royal Statistical Society: Series A (Statistics in Society) 182(3), 755–803.CrossRef Google Scholar

Fisher, N. and Lee, A. (2011). Getting the ‘correct’ answer from survey responses: A simple application of the EM algorithm. Australian & New Zealand Journal of Statistics 53(3), 353–364.CrossRef Google Scholar

Forst, M. and Kaplan, R.M. (2006). The importance of precise tokenizing for deep grammars. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06).Google Scholar

Fosler-Lussier, E. (1998). Markov models and hidden Markov models: A brief tutorial. International Computer Science Institute.Google Scholar

Foster, J., Cetinoglu, O., Wagner, J., Le Roux, J., Hogan, S., Nivre, J., Hogan, D., and Van Genabith, J. 2011. # hardtoparse: POS tagging and parsing the Twitterverse. In Workshops at the Twenty-Fifth AAAI Conference on Artificial Intelligence. AAAI stands for Association for the Advancement of Artificial Intelligence.Google Scholar

Fowler, F. J. 1995. Improving Survey Questions: Design and Evaluation, vol. 38. Thousand Oaks, CA, USA: Sage.Google Scholar

Fox, C. 1989. A stop list for general text. In ACM SIGIR Forum, vol. 24, pp. 19–21. ACM. SIGIR stands for Special Interest Group on Information Retrieval.CrossRef Google Scholar

Friedman, H.H. and Amoo, T. (1999). Rating the rating scales. Journal of Marketing Management 9(3), 114–123.Google Scholar

Frigau, L., Wu, Q. and Banks, D. (2021). Optimizing the JSM program. Journal of the American Statistical Association 00(0), 1–10. JSM stands for Joint Statistical Meetings.Google Scholar

Fundel, K., Küffner, R. and Zimmer, R. (2007). RelEx – relation extraction using dependency parse trees. Bioinformatics 23(3), 365–371.CrossRef Google Scholar PubMed

Gabernet, A.R. and Limburn, J. (2017). Breaking the 80/20 rule: How data catalogs transform data scientists’ productivity. Available from: https://www.ibm.com/cloud/blog/ibm-data-catalog-data-scientists-productivity.Google Scholar

Gamallo, P., Campos, J.R.P. and Alegria, I. (2017). A perplexity-based method for similar languages discrimination. In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), pp. 109–114.CrossRef Google Scholar

Garíca, S., Ramírez-Gallego, S., Luengo, J., Benítez, J.M. and Herrera, F. (2016). Big data preprocessing: Methods and prospects. Big Data Analytics 1(1), 1–22.Google Scholar

Gentzkow, M., Kelly, B.T. and Taddy, M. (2017). Text as data. Technical report, National Bureau of Economic Research, Cambridge, MA, USA.CrossRef Google Scholar

Gerlach, M., Shi, H. and Amaral, L.A.N. (2019). A universal information theoretic approach to the identification of stopwords. Nature Machine Intelligence 1(12), 606–612.CrossRef Google Scholar

Gerz, D., Vulić, I., Ponti, E., Naradowsky, J., Reichart, R. and Korhonen, A. (2018). Language modeling for morphologically rich languages: Character-aware modeling for word-level prediction. Transactions of the Association for Computational Linguistics 6, 451–465.CrossRef Google Scholar

Ghosh, S., Johansson, R., Riccardi, G. and Tonelli, S. (2011). Shallow discourse parsing with conditional random fields. In Proceedings of 5th International Joint Conference on Natural Language Processing, pp. 1071–1079.Google Scholar

Globerson, A. and Roweis, S. (2006). Nightmare at test time: Robust learning by feature deletion. In Proceedings of the Twenty-Third International Conference on Machine Learning. ACM, pp. 353–360.CrossRef Google Scholar

Goldberg, Y. and Orwant, J. (2013). A dataset of syntactic-ngrams over time from a very large corpus of English books. Technical report, Google Research.Google Scholar

Gomes, F.B., Adán-Coello, J.M. and Kintschner, F.E. (2018). Studying the effects of text preprocessing and ensemble methods on sentiment analysis of Brazilian Portuguese Tweets. In International Conference on Statistical Language and Speech Processing. Springer, pp. 167–177.CrossRef Google Scholar

Goodman, E.L., Zimmerman, C. and Hudson, C. (2020). Packet2vec: Utilizing word2vec for feature extraction in packet data. arXiv preprint arXiv:2004.14477.Google Scholar

Gorrell, G., Petrak, J. and Bontcheva, K. (2015). Using @Twitter conventions to improve #LOD-based named entity disambiguation. In European Semantic Web Conference. Springer, pp. 171–186. LOD stands for Linked Open Data.CrossRef Google Scholar

Goyvaerts, J. and Levithan, S. (2012). Regular Expressions Cookbook. Sebastopol, CA, USA: O’Reilly Media Inc.Google Scholar

Grabar, N., Zweigenbaum, P., Soualmia, L. and Darmoni, S. (2003). Matching controlled vocabulary words. In Studies in Health Technology and Informatics, pp. 445–450.Google Scholar

Grana, J., Alonso, M.A. and Vilares, M. (2002). A common solution for tokenization and part-of-speech tagging. In International Conference on Text, Speech and Dialogue, vol. 2448. Springer, pp. 3–10.Google Scholar

Grefenstette, G. and Tapanainen, P. (1994). What is a word, what is a sentence? Problems of tokenisation. Technical report, Rank Xerox Research Centre, Grenoble Laboratory, Meylan, France.Google Scholar

Gries, S.T. and Mukherjee, J. (2010). Lexical gravity across varieties of English: An ICE-based study of n-grams in Asian Englishes. International Journal of Corpus Linguistics 15(4), 520–548. ICE stands for International Corpus of English.CrossRef Google Scholar

Grishman, R. (2015). Information extraction. IEEE Intelligent Systems 30(5), 8–15.CrossRef Google Scholar

Grobelnik, M. and Mladenic, D. (2004). Text-mining tutorial. Technical report, Jožef Stefan Institute (JSI), Slovenia.Google Scholar

Grön, L. and Bertels, A. (2018). Clinical sublanguages: Vocabulary structure and its impact on term weighting. Terminology: International Journal of Theoretical and Applied Issues in Specialized Communication 24(1), 41–65.Google Scholar

Groza, T. and Verspoor, K. (2014). Automated generation of test suites for error analysis of concept recognition systems. In Proceedings of the Australasian Language Technology Association Workshop 2014, pp. 23–31.Google Scholar

Guthrie, D., Allison, B., Liu, W., Guthrie, L. and Wilks, Y. (2006). A closer look at skip-gram modelling. In LREC (International Conference on Language Resources and Evaluation), vol. 6, pp. 1222–1225.Google Scholar

Ha, L., Hanna, P., Ming, J. and Smith, F. (2009). Extending Zipf’s law to n-grams for large corpora. Artificial Intelligence Review 32(1–4), 101–113.CrossRef Google Scholar

Habert, B., Adda, G., Adda-Decker, M., de Marëuil, P.B., Ferrari, S., Ferret, O., Illouz, G. and Paroubek, P. (1998). Towards tokenization evaluation. In Proceedings of the First International Conference on Language Resources and Evaluation (LREC), pp. 427–431.Google Scholar

HaCohen-Kerner, Y., Miller, D. and Yigal, Y. (2020). The influence of preprocessing on text classification using a bag-of-words representation. PLOS One 15(5), e0232525.CrossRef Google Scholar PubMed

Haddi, E., Liu, X. and Shi, Y. (2013). The role of text pre-processing in sentiment analysis. Procedia Computer Science 17, 26–32.CrossRef Google Scholar

Hajba, G.L. (2018). Using beautiful soup. In Website Scraping with Python. Springer, pp. 41–96.CrossRef Google Scholar

Hansen, C., Hansen, C., Simonsen, J.G. and Lioma, C. (2018). The Copenhagen team participation in the check-worthiness task of the competition of automatic identification and verification of claims in political debates of the CLEF-2018 CheckThat! lab. In CLEF (Working Notes). CLEF stands for Conference and Labs of the Evaluation Forum.Google Scholar

Hartrumpf, S., Helbig, H. and Osswald, R. (2006). Semantic interpretation of prepositions for NLP applications. In Proceedings of the Third ACL-SIGSEM Workshop on Prepositions.CrossRef Google Scholar

He, W., Zha, S. and Li, L. (2013). Social media competitive analysis and text mining: A case study in the pizza industry. International Journal of Information Management 33(3), 464–472.CrossRef Google Scholar

Head, M.L., Holman, L., Lanfear, R., Kahn, A.T. and Jennions, M.D. (2015). The extent and consequences of p-hacking in science. PLOS Biology 13(3), e1002106.CrossRef Google Scholar PubMed

Hedderich, M.A., Lange, L., Adel, H., Strötgen, J. and Klakow, D. (2020). A survey on recent approaches for natural language processing in low-resource scenarios. arXiv preprint arXiv:2010.12309.Google Scholar

Heimerl, F., Lohmann, S., Lange, S. and Ertl, T. (2014). Word cloud explorer: Text analytics based on word clouds. In 2014 Forty-Seventh Hawaii International Conference on System Sciences. IEEE, pp. 1833–1842.CrossRef Google Scholar

Henry, T. (2016). Quick ngram processing script. Available from: https://github.com/trh3/NGramProcessing.Google Scholar

Hernández, M.A. and Stolfo, S.J. (1998). Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery 2(1), 9–37.CrossRef Google Scholar

Hickman, L., Thapa, S., Tay, L., Cao, M. and Srinivasan, P. (2020). Text preprocessing for text mining in organizational research: Review and recommendations. In Organizational Research Methods, pp. 1–58.Google Scholar

Hiraoka, T., Shindo, H. and Matsumoto, Y. (2019). Stochastic tokenization with a language model for neural text classification. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1620–1629.CrossRef Google Scholar

Hoffman, B. (1996). Translating into free word order languages. In Proceedings of the 16th Conference on Computational Linguistics-Volume 1, pp. 556–561.CrossRef Google Scholar

Hofherr, P.C. (2012). Preposition-determiner amalgams in German and French at the syntax-morphology interface. In Ackema, P. (ed), Comparative Germanic Syntax: The State of the Art, Amsterdam, Netherlands. John Benjamins Publishing Company, pp. 99–132.CrossRef Google Scholar

Houvardas, J. and Stamatatos, E. (2006). N-gram feature selection for authorship identification. In International Conference on Artificial Intelligence: Methodology, Systems, and Applications. Springer, pp. 77–86.CrossRef Google Scholar

Huang, H. and Zhang, B. (2009). Text segmentation. In Liu, L. and Özsu, M.T. (eds), Encyclopedia of Database Systems, vol. 6. Springer, pp. 3072–3075.CrossRef Google Scholar

Huston, S., Culpepper, J.S. and Croft, W.B. (2014). Indexing word sequences for ranked retrieval. ACM Transactions on Information Systems (TOIS) 32(1), 1–26.CrossRef Google Scholar

Huston, S., Moffat, A. and Croft, W.B. (2011). Efficient indexing of repeated n-grams. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, pp. 127–136.CrossRef Google Scholar

Hutto, C. and Gilbert, E. (2014). VADER: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the International AAAI Conference on Web and Social Media, vol. 8, pp. 216–225. AAAI stands for Association for the Advancement of Artificial Intelligence.CrossRef Google Scholar

Hvitfeldt, E. and Silge, J. (2021). Supervised Machine Learning for Text Analysis in

$\mathsf{R}$ . New York, NY, USA: Chapman and Hall/CRC.CrossRef Google Scholar

Indurkhya, N. and Damerau, F.J. (2010). Handbook of Natural Language Processing. New York, NY, USA: Chapman and Hall/CRC.CrossRef Google Scholar

Irfan, R., King, C.K., Grages, D., Ewen, S., Khan, S.U., Madani, S.A., Kolodziej, J., Wang, L., Chen, D. and Rayes, A. (2015). A survey on text mining in social networks. The Knowledge Engineering Review 30(2), 157–170.CrossRef Google Scholar

Jean-Baptiste, E. (1916). Gammes sténographiques. Technical report, Institut Sténographique de France, Paris.Google Scholar

Ježek, E. (2016). The Lexicon: An Introduction. Oxford, UK: Oxford University Press.Google Scholar

Ji, Y. and Eisenstein, J. (2014). Representation learning for text-level discourse parsing. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 13–24.CrossRef Google Scholar

Ji, Z., Wei, Q. and Xu, H. (2020). BERT-based ranking for biomedical entity normalization. AMIA Summits on Translational Science Proceedings, 2020, 269–277. AMIA stands for American Medical Informatics Association.Google Scholar

Jiang, J. and Zhai, C. (2007). An empirical study of tokenization strategies for biomedical information retrieval. Information Retrieval 10(4–5), 341–363.CrossRef Google Scholar

Jiang, Z., Li, L., Huang, D. and Jin, L. (2015). Training word embeddings for deep learning in biomedical text mining tasks. In 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, pp. 625–628.CrossRef Google Scholar

Jimenez, M., Maxime, C., Le Traon, Y. and Papadakis, M. (2018). On the impact of tokenizer and parameters on n-gram based code analysis. In 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, pp. 437–448.CrossRef Google Scholar

Jivani, A.G. (2011). A comparative study of stemming algorithms. International Journal of Computer Technology and Applications (IJCTA) 2(6), 1930–1938.Google Scholar

Jones, B.E. (1994). Exploring the role of punctuation in parsing natural text. In Proceedings of the Fifteenth Conference on Computational Linguistics, vol. 1. Association for Computational Linguistics, pp. 421–425.CrossRef Google Scholar

Jones, R. (2006). Internet Slang Dictionary. Durham NC, USA: Lulu.com.Google Scholar

Joty, S., Carenini, G., Ng, R. and Murray, G. (2019). Discourse analysis and its applications. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, pp. 12–17.CrossRef Google Scholar

Joulin, A., Grave, E., Bojanowski, P. and Mikolov, T. (2016). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.Google Scholar

Jurafsky, D. and Martin, J.H. (2008). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 2nd Edn., Fort Collins, CO, USA: Prentice Hall Series in Artificial Intelligence.Google Scholar

Jusoh, S. (2018). A study on NLP applications and ambiguity problems. Journal of Theoretical & Applied Information Technology 96(6), 1486–1499.Google Scholar

Kadhim, A.I. (2018). An evaluation of preprocessing techniques for text classification. International Journal of Computer Science and Information Security (IJCSIS) 16(6), 22–32.Google Scholar

Kaewphan, S., Mehryary, F., Hakala, K., Salakoski, T. and Ginter, F. (2017). TurkuNLP entry for interactive Bio-ID assignment. In Proceedings of the BioCreative VI Workshop, pp. 32–35.Google Scholar

Kaiser, E. and Trueswell, J.C. (2004). The role of discourse context in the processing of a flexible word-order language. Cognition 94(2), 113–147.CrossRef Google Scholar PubMed

Kalra, V. and Aggarwal, R. (2018). Importance of text data preprocessing & implementation in RapidMiner. In Proceedings of the First International Conference on Information Technology and Knowledge Management (ICITKM), vol. 14, pp. 71–75.Google Scholar

Kamps, J., Adafre, S.F. and De Rijke, M. (2004). Effective translation, tokenization and combination for cross-lingual retrieval. In Workshop of the Cross-Language Evaluation Forum for European Languages. Springer, pp. 123–134.Google Scholar

Kannan, S. and Gurusamy, V. (2014). Preprocessing techniques for text mining. International Journal of Computer Science & Communication Networks 5(1), 7–16.Google Scholar

Kao, A. and Poteet, S.R. (2007). Natural Language Processing and Text Mining. London, UK: Springer Science & Business Media.CrossRef Google Scholar

Karthikeyan, S., Jotheeswaran, J., Balamurugan, B. and Chatterjee, J.M. (2020). Text mining. In Natural Language Processing in Artificial Intelligence, pp. 167–210.CrossRef Google Scholar

Kathuria, A., Gupta, A. and Singla, R. (2021). A review of tools and techniques for preprocessing of textual data. In Computational Methods and Data Engineering, pp. 407–422.CrossRef Google Scholar

Kaufmann, M. and Kalita, J. (2010). Syntactic normalization of Twitter messages. In International Conference on Natural Language Processing, Kharagpur, India. Google Scholar

Kaur, J. and Buttar, P.K. (2018). A systematic review on stopword removal algorithms. International Journal on Future Revolution in Computer Science & Communication Engineering 4(4), 207–210.Google Scholar

Kaviani, M. and Rahmani, H. (2020). EmHash: Hashtag recommendation using neural network based on BERT embedding. In 2020 6th International Conference on Web Research (ICWR). IEEE, pp. 113–118.CrossRef Google Scholar

Khyani, D., Siddhartha, B., Niveditha, N. and Divya, B. (2020). An interpretation of lemmatization and stemming in natural language processing. Journal of University of Shanghai for Science and Technology 22(10), 350–357.Google Scholar

Kim, K.H. and Zhu, Y. (2019). Researching Translation in the Age of Technology and Global Conflict: Selected Works of Mona Baker. Abingdon, UK: Routledge.CrossRef Google Scholar

Kiss, T. and Strunk, J. (2006). Unsupervised multilingual sentence boundary detection. Computational Linguistics 32(4), 485–525.CrossRef Google Scholar

Kitaev, N., Cao, S. and Klein, D. (2019). Multilingual constituency parsing with self-attention and pre-training. arXiv preprint arXiv:1812.11760.Google Scholar

Koehn, P., Arun, A. and Hoang, H. (2008). Towards better machine translation quality for the German-English language pairs. In Proceedings of the Third Workshop on Statistical Machine Translation, pp. 139–142.CrossRef Google Scholar

Korde, V. and Mahender, C.N. (2012). Text classification and classifiers: A survey. International Journal of Artificial Intelligence & Applications 3(2), 85.CrossRef Google Scholar

Korenius, T., Laurikkala, J., Järvelin, K. and Juhola, M. (2004). Stemming and lemmatization in the clustering of Finnish text documents. In Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, pp. 625–633.CrossRef Google Scholar

Koto, F. and Adriani, M. (2015). A comparative study on Twitter sentiment analysis: Which features are good? In International Conference on Applications of Natural Language to Information Systems. Springer, pp. 453–457.Google Scholar

Koulali, R. and Meziane, A. (2011). Topic detection and multi-word terms extraction for Arabic unvowelized documents. In Asia Information Retrieval Symposium. Springer, pp. 614–623.CrossRef Google Scholar

Kozakou, E. (2017). Word Adaptions in the Language of Twitter. Master’s Thesis, Leiden University, Leiden, Netherlands.Google Scholar

Kudo, T. and Richardson, J. (2018). Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226.Google Scholar

Kulkarni, A. and Shivananda, A. (2019). Natural Language Processing Recipes. Berkeley, CA, USA: Springer.CrossRef Google Scholar

Kunilovskaya, M. and Plum, A. (2021). Text preprocessing and its implications in a digital humanities project. In Proceedings of the Student Research Workshop Associated with RANLP 2021, pp. 85–93. RANLP stands for International Conference on Recent Advances in Natural Language Processing.CrossRef Google Scholar

Kutuzov, A., Fares, M., Oepen, S. and Velldal, E. (2017). Word vectors, reuse, and replicability: Towards a community repository of large-text resources. In Proceedings of the 58th Conference on Simulation and Modelling, Linköping, Sweden. Linköping University Electronic Press, pp. 271–276.Google Scholar

Kwak, H., Lee, C., Park, H. and Moon, S. (2010). What is Twitter, a social network or a news media? In Proceedings of the Nineteenth International Conference on World Wide Web. ACM, pp. 591–600.Google Scholar

Kwartler, T. (2017). Text Mining in Practice with

$\mathsf{R}$ . Hoboken, NJ, USA: John Wiley & Sons.CrossRef Google Scholar

Labusch, M., Kotz, S.A. and Perea, M. (2022). The impact of capitalized German words on lexical access. Psychological Research 2022(86), 891–902.CrossRef Google Scholar

Lafferty, J.D. and Blei, D.M. (2006). Correlated topic models. In Advances in Neural Information Processing Systems, pp. 147–154.Google Scholar

Lahiri, S. and Mihalcea, R. (2013). Using n-gram and word network features for native language identification. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 251–259.Google Scholar

Lambert, N.J. (2017). Text mining tutorial. In Group Processes. Edinburgh, Scotland, UK: Springer, Cham, pp. 93–117.CrossRef Google Scholar

Lambert, P. and Banchs, R.E. (2006). Grouping multi-word expressions according to part-of-speech in statistical machine translation. In Proceedings of the Workshop on Multi-Word-Expressions in a Multilingual Context.Google Scholar

Lan, Y. (1996). Chomskian deep structure and translation. Perspectives: Studies in Translatology 4(1), 103–113.CrossRef Google Scholar

Lazarinis, F. (2007). Evaluating the searching capabilities of e-commerce web sites in a non-English language. Online Information Review 31(6), 881–891.CrossRef Google Scholar

Le, H.P. and Ho, T.V. (2008). A maximum entropy approach to sentence boundary detection of Vietnamese texts. In IEEE International Conference on Research, Innovation and Vision for the Future-RIVF 2008.Google Scholar

Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H. and Kang, J. (2020). BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4), 1234–1240.CrossRef Google Scholar PubMed

Leek, J.T. and Jager, L.R. (2017). Is most published research really false? Annual Review of Statistics and Its Application 4, 109–122.CrossRef Google Scholar

Lende, S.P. and Raghuwanshi, M. (2016). Question answering system on education acts using NLP techniques. In 2016 World Conference on Futuristic Trends in Research and Innovation for Social Welfare (Startup Conclave). IEEE, pp. 1–6.CrossRef Google Scholar

Lextek International n.d. Onix text retrieval toolkit – API reference. Available from: https://www.lextek.com/manuals/onix/ [last accessed August 2019].Google Scholar

Li, C., Duan, Y., Wang, H., Zhang, Z., Sun, A. and Ma, Z. (2017). Enhancing topic modeling for short texts with auxiliary word embeddings. ACM Transactions on Information Systems (TOIS) 36(2), 1–30.CrossRef Google Scholar

Li, C., Wang, H., Zhang, Z., Sun, A. and Ma, Z. (2016). Topic modeling for short texts with auxiliary word embeddings. In Proceedings of the Thirty-Ninth International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, pp. 165–174.CrossRef Google Scholar

Li, L., Shao, Y., Song, D., Qiu, X. and Huang, X. (2020). Generating adversarial examples in Chinese texts using sentence-pieces. arXiv preprint arXiv:2012.14769.Google Scholar

Li, X. and Croft, W.B. (2001). Incorporating syntactic information in question answering. Technical report, Center for Intelligent Information Retrieval, University of Massachusetts at Amherst, Amherst MA, USA.CrossRef Google Scholar

Lin, C.-Y. and Hovy, E. (2003). Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 150–157.CrossRef Google Scholar

Lin, J., Nogueira, R. and Yates, A. (2020). Pretrained transformers for text ranking: BERT and beyond. arXiv preprint arXiv:2010.06467.Google Scholar

Lita, L.V., Ittycheriah, A., Roukos, S. and Kambhatla, N. (2003). tRuEcasIng. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pp. 152–159.CrossRef Google Scholar

Litvak, M. and Vanetik, N. (2019). Multilingual Text Analysis: Challenges, Models, and Approaches. Singapore: World Scientific.CrossRef Google Scholar

Liu, B. (2015). Sentiment Analysis: Mining Opinions, Sentiments, and Emotions. Cambridge, UK: Cambridge University Press.CrossRef Google Scholar

Liu, H., Christiansen, T., Baumgartner, W.A. and Verspoor, K. (2012). BioLemmatizer: A lemmatization tool for morphological processing of biomedical text. Journal of Biomedical Semantics 3(1), 1–29.CrossRef Google Scholar PubMed

Liu, L. and Özsu, M.T. (2009). Encyclopedia of Database Systems, vol. 6. New York, NY, USA: Springer.CrossRef Google Scholar

Liu, Z. and Jansen, B.J. (2013). Factors influencing the response rate in social question and answering behavior. In Proceedings of the 2013 Conference on Computer Supported Cooperative Work, pp. 1263–1274.CrossRef Google Scholar

Lo, R.T.-W., He, B. and Ounis, I. (2005). Automatically building a stopword list for an information retrieval system. In Journal on Digital Information Management: Special Issue on the 5th Dutch-Belgian Information Retrieval Workshop (DIR), vol. 5, pp. 17–24.Google Scholar

Loginova, E., Varanasi, S. and Neumann, G. (2018). Towards multilingual neural question answering. In European Conference on Advances in Databases and Information Systems. Springer, pp. 274–285.Google Scholar

Los, B. and Lubbers, T. (2019). Syntax, text type, genre and authorial voice in old English: A data-driven approach. In Grammar–Discourse–Context: Grammar and Usage in Language Variation and Change, vol. 23, p. 49.CrossRef Google Scholar

Lourentzou, I., Manghnani, K. and Zhai, C. (2019). Adapting sequence to sequence models for text normalization in social media. In Proceedings of the International AAAI Conference on Web and Social Media, vol. 13, pp. 335–345. AAAI stands for Association for the Advancement of Artificial Intelligence.CrossRef Google Scholar

Lovins, J.B. (1968). Development of a stemming algorithm. Mechanical Translation and Computational Linguistics 11(1–2), 22–31.Google Scholar

Lüdeling, A. and Kytö, M. (2008). Corpus Linguistics, vol. 1. Berlin, Germany: Walter de Gruyter.CrossRef Google Scholar

Luhn, H.P. (1959). Keyword-in-context index for technical literature (KWIC index). American Documentation 11(4), 288–295.CrossRef Google Scholar

Luiz, O.J., Olden, J.D., Kennard, M.J., Crook, D.A., Douglas, M.M., Saunders, T.M. and King, A.J. (2019). Trait-based ecology of fishes: A quantitative assessment of literature trends and knowledge gaps using topic modelling. Fish and Fisheries 20(6), 1100–1110.CrossRef Google Scholar

Luo, J., Tinsley, J. and Lepage, Y. (2013). Exploiting parallel corpus for handling out-of-vocabulary words. In Proceedings of the 27th Pacific Asia Conference on Language, Information, and Computation (PACLIC 27), pp. 399–408.Google Scholar

Lusetti, M., Ruzsics, T., Göhring, A., Samardžić, T. and Stark, E. (2018). Encoder-decoder methods for text normalization. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018). Association for Computational Linguistics, pp. 18–28.Google Scholar

Makrehchi, M. and Kamel, M.S. (2008). Automatic extraction of domain-specific stopwords from labeled documents. In European Conference on Information Retrieval. Springer, pp. 222–233.CrossRef Google Scholar

Makrehchi, M. and Kamel, M.S. (2017). Extracting domain-specific stopwords for text classifiers. Intelligent Data Analysis 21(1), 39–62.CrossRef Google Scholar

Malvern, D. and Richards, B. (2012). Measures of lexical richness. In The Encyclopedia of Applied Linguistics.CrossRef Google Scholar

Manning, C., Raghavan, P. and Schütze, H. (2008). Introduction to Information Retrieval. Cambridge, UK: Cambridge University Press.CrossRef Google Scholar

Mansurov, B. and Mansurov, A. (2021). Uzbek Cyrillic-Latin-Cyrillic machine transliteration. arXiv preprint arXiv:2101.05162.Google Scholar

Marcu, D. (2000). The Theory and Practice of Discourse Parsing and Summarization. Cambridge, MA, USA: MIT Press.CrossRef Google Scholar

Masini, F. (2005). Multi-word expressions between syntax and the lexicon: The case of Italian verb-particle constructions. SKY Journal of Linguistics 18(2005), 145–173. SKY stands for Suomen kielitieteellinen yhdistys, from the Linguistic Association of Finland. Google Scholar

Masini, F. (2019). Multi-word expressions and morphology. In Oxford Research Encyclopedia of Linguistics.CrossRef Google Scholar

Matusov, E., Leusch, G., Bender, O. and Ney, H. (2005). Evaluating machine translation output with automatic sentence segmentation. In International Workshop on Spoken Language Translation (IWSLT).Google Scholar

McAuliffe, J.D. and Blei, D.M. (2008). Supervised topic models. In Advances in Neural Information Processing Systems, pp. 121–128.Google Scholar

McNamee, P. and Mayfield, J. (2007). N-gram morphemes for retrieval. In CLEF (Working Notes). CLEF stands for the Cross-Language Evaluation Forum workshop.Google Scholar

Mendoza, M., Poblete, B. and Castillo, C. (2010). Twitter under crisis: Can we trust what we RT? In Proceedings of the First Workshop on Social Media Analytics. ACM, pp. 71–79.Google Scholar

Merlo, P. (2019). Probing word and sentence embeddings for long-distance dependencies effects in French and English. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 158–172.CrossRef Google Scholar

Metzler, H., Baginski, H., Niederkrotenthaler, T. and Garcia, D. (2021). Detecting potentially harmful and protective suicide-related content on Twitter: A machine learning approach. arXiv preprint arXiv:2112.04796.Google Scholar

Meyer, D., Hornik, K. and Feinerer, I. (2008). Text mining infrastructure in

$\mathsf{R}$ . Journal of Statistical Software 25(5), 1–54.Google Scholar

Miah, M. (2009). Improved k-NN algorithm for text classification. In Proceedings of the 2009 International Conference on Data Mining (DMIN). Citeseer, pp. 434–440.Google Scholar

Microsoft (2019). Extract n-gram features from text. Available from: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/extract-n-gram-features-from-text.Google Scholar

Mieke, S.S. (2016). Language diversity in ACL 2004–2016. ACL stands for the annual meeting of the Association for Computational Linguistics. Available from: https://sjmielke.com/acl-language-diversity.htm.Google Scholar

Mikolov, T., Chen, K., Corrado, G. and Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.Google Scholar

Miller, J.-A. (2014). Language: Much ado about what? In Lacan and the Subject of Language, pp. 21–35.Google Scholar

Mitrofan, M. and Tufiş, D. (2018). Bioro: The biomedical corpus for the Romanian language. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).Google Scholar

Moh, T.-S. and Zhang, Z. (2012). Cross-lingual text classification with model translation and document translation. In Proceedings of the 50th Annual Southeast Regional Conference, pp. 71–76.CrossRef Google Scholar

Moon, S. and Okazaki, N. (2020). Jamo pair encoding: Subcharacter representation-based extreme Korean vocabulary compression for efficient subword tokenization. In Proceedings of the 12th Language Resources and Evaluation Conference, pp. 3490–3497.Google Scholar

Moon, S. and Okazaki, N. (2021). Effects and mitigation of out-of-vocabulary in universal language models. Journal of Information Processing 29, 490–503.CrossRef Google Scholar

Morante, R. and Blanco, E. (2021). Recent advances in processing negation. Natural Language Engineering 27(2), 1–10.Google Scholar

Morgan, A.A., Lu, Z., Wang, X., Cohen, A.M., Fluck, J., Ruch, P., Divoli, A., Fundel, K., Leaman, R. and Hakenberg, J. (2008). Overview of BioCreative II gene normalization. Genome Biology 9(2), 1–19.CrossRef Google Scholar PubMed

Mostafa, M.M. (2013). More than words: Social networks’ text mining for consumer brand sentiments. Expert Systems with Applications 40(10), 4241–4251.CrossRef Google Scholar

Mubarak, H. (2017). Build fast and accurate lemmatization for Arabic. arXiv preprint arXiv:1710.06700.Google Scholar

Mullen, L.A., Benoit, K., Keyes, O., Selivanov, D. and Arnold, J. (2018). Fast, consistent tokenization of natural language text. Journal of Open Source Software 3(23), 655.CrossRef Google Scholar

Munro, R. (2015). Language at ACL this year. ACL stands for the annual conference of the Association for Computational Linguistics. Available from: http://www.junglelightspeed.com/languages-at-acl-this-year/.Google Scholar

Namly, D., Bouzoubaa, K., El Jihad, A. and Aouragh, S.L. (2020). Improving Arabic lemmatization through a lemmas database and a machine-learning technique. In Recent Advances in NLP: The Case of Arabic Language. Springer, pp. 81–100.CrossRef Google Scholar

Narala, S., Rani, B.P. and Ramakrishna, K. (2017). Telugu text categorization using language models. Global Journal of Computer Science and Technology 16(4), 9–13.Google Scholar

Nasir, A., Aslam, K., Tariq, S. and Ullah, M.F. (2020). Predicting mental illness using social media posts and comments. International Journal of Advanced Computer Science and Applications 11(12), 607–613.Google Scholar

Nayak, A.S., Kanive, A., Chandavekar, N. and Balasubramani, R. (2016). Survey on pre-processing techniques for text mining. International Journal of Engineering and Computer Science 5(6) 2319–7242.Google Scholar

Nayel, H.A., Shashirekha, H., Shindo, H. and Matsumoto, Y. (2019). Improving multi-word entity recognition for biomedical texts. arXiv preprint arXiv:1908.05691.Google Scholar

Névéol, A., Robert, A., Anderson, R., Cohen, K.B., Grouin, C., Lavergne, T., Rey, G., Rondet, C. and Zweigenbaum, P. (2017). CLEF eHealth 2017 multilingual information extraction task overview: ICD10 coding of death certificates in English and French. In CLEF (Working Notes). CLEF stands for Conference and Labs of the Evaluation Forum.Google Scholar

Newman, M.E. (2005). Power laws, Pareto distributions and Zipf’s law. Contemporary Physics 46(5), 323–351.CrossRef Google Scholar

Ng, D., Bansal, M. and Curran, J.R. (2015). Web-scale surface and syntactic n-gram features for dependency parsing. arXiv preprint arXiv:1502.07038.Google Scholar

Nguyen, D.Q., Billingsley, R., Du, L. and Johnson, M. (2015). Improving topic models with latent feature word representations. Transactions of the Association for Computational Linguistics 3, 299–313.CrossRef Google Scholar

Nicolai, G. and Kondrak, G. (2016). Leveraging inflection tables for stemming and lemmatization. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1138–1147.CrossRef Google Scholar

Nivre, J. (2005). Dependency grammar and dependency parsing. MSI Report 5133(1959), 1–32. MSI report is from Växjö University, Växjö, Sweden.Google Scholar

Nothman, J., Qin, H. and Yurchak, R. (2018). Stop word lists in free open-source software packages. In Proceedings of Workshop for NLP Open Source Software (NLP-OSS), pp. 7–12.CrossRef Google Scholar

Novák, A. and Siklósi, B. (2015). Automatic diacritics restoration for Hungarian. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2286–2291.CrossRef Google Scholar

Nugent, R. (2020). Instead of just teaching data science, let’s understand how and why people do it. Symposium on Data Science and Statistics. Abstract available from: https://ww2.amstat.org/meetings/sdss/2020/onlineprogram/AbstractDetails.cfm?AbstractID=308230.Google Scholar

Nunberg, G. (1990). The Linguistics of Punctuation , vol. 18. Center for the Study of Language (CSLI), Stanford, CA, USA: Stanford University.Google Scholar

Nuzumlal, M.Y. and Özgür, A. (2014). Analyzing stemming approaches for Turkish multi-document summarization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 702–706.CrossRef Google Scholar

Olde, B.A., Hoeffner, J., Chipman, P., Graesser, A.C. and Research Group, Tutoring (1999). A connectionist model for part of speech tagging. In Florida Artificial Intelligence Research Society (FLAIRS) Conference, pp. 172–176.Google Scholar

Paice, C. and Hooper, R. (2005). Lancaster stemmer. Available from: http://www.comp.lancs.ac.uk/computing/research/stemming/[last accessed August 2019].Google Scholar

Pais, V., Ion, R., Avram, A.-M., Mitrofan, M. and Tufis, D. (2021). In-depth evaluation of Romanian natural language processing pipelines. Romanian Journal of Information Science and Technology 24(4), 384–401.Google Scholar

Pak, A. and Paroubek, P. (2010). Twitter as a corpus for sentiment analysis and opinion mining. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC), pp. 1320–1326.Google Scholar

Pantel, P. and Pennacchiotti, M. (2006). Espresso: Leveraging generic patterns for automatically harvesting semantic relations. In Proceedings of the Twenty-First International Conference on Computational Linguistics and the Forty-Fourth Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, pp. 113–120.CrossRef Google Scholar

Papakyriakopoulos, O., Hegelich, S., Serrano, J.C.M. and Marco, F. (2020). Bias in word embeddings. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 446–457.CrossRef Google Scholar

Pareti, S., O’Keefe, T., Konstas, I., Curran, J.R. and Koprinska, I. (2013). Automatically detecting and attributing indirect quotations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 989–999.Google Scholar

Patil, A., Pharande, K., Nale, D. and Agrawal, R. (2015). Automatic text summarization. International Journal of Computer Applications 109(17), 18–19.CrossRef Google Scholar

Patro, G.K., Chakraborty, A., Ganguly, N. and Gummadi, K.P. (2020). On fair virtual conference scheduling: Achieving equitable participant and speaker satisfaction. arXiv preprint arXiv:2010.14624.Google Scholar

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M. and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830.Google Scholar

Peitz, S., Freitag, M., Mauser, A. and Ney, H. (2011). Modeling punctuation prediction as machine translation. In International Workshop on Spoken Language Translation (IWSLT).Google Scholar

Peldszus, A. and Stede, M. (2015). Joint prediction in MST-style discourse parsing for argumentation mining. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 938–948. MST stands for minimum spanning tree.CrossRef Google Scholar

Pennell, D. and Liu, Y. (2011). Toward text message normalization: Modeling abbreviation generation. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 5364–5367.CrossRef Google Scholar

Pennington, J., Socher, R. and Manning, C.D. (2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543.CrossRef Google Scholar

Perkins, J. (2014). Python 3 Text Processing with NLTK 3 Cookbook. Birmingham, UK: Packt Publishing Ltd.Google Scholar

Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K. and Zettlemoyer, L. (2018). Deep contextualized word representations. arXiv preprint arXiv:1802.05365.Google Scholar

Petic, M. and Gîfu, D. (2014). Transliteration and alignment of parallel texts from Cyrillic to Latin. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp. 1819–1823.Google Scholar

Piktus, A., Edizel, N.B., Bojanowski, P., Grave, É., Ferreira, R. and Silvestri, F. (2019). Misspelling oblivious word embeddings. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 3226–3234.CrossRef Google Scholar

Poddar, L. (2016). Multilingual multiword expressions. arXiv preprint arXiv:1612.00246.Google Scholar

Polignano, M., Basile, P., De Gemmis, M., Semeraro, G. and Basile, V. (2019). Alberto: Italian BERT language understanding model for NLP challenging tasks based on Tweets. In 6th Italian Conference on Computational Linguistics, CLiC-it 2019, vol. 2481. CEUR Workshop Proceedings, pp. 1–6.Google Scholar

Poria, S., Agarwal, B., Gelbukh, A., Hussain, A. and Howard, N. (2014). Dependency-based semantic parsing for concept-level text analysis. In International Conference on Intelligent Text Processing and Computational Linguistics. Springer, pp. 113–127.Google Scholar

Porter, M.F. (1980). An algorithm for suffix stripping. Program 14(3), 130–137.CrossRef Google Scholar

Posada, J.D., Barda, A.J., Shi, L., Xue, D., Ruiz, V., Kuan, P.-H., Ryan, N.D. and Tsui, F.R. (2017). Predictive modeling for classification of positive valence system symptom severity from initial psychiatric evaluation records. Journal of Biomedical Informatics 75, S94–S104.CrossRef Google Scholar

Potts, C. (n.d). Sentiment symposium tutorial: Stemming. Available from: http://sentiment.christopherpotts.net/stemming.html [last accessed August 2019].Google Scholar

Preethi, V. (2021). Survey on text transformation using Bi-LSTM in natural language processing with text data. Turkish Journal of Computer and Mathematics Education (TURCOMAT) 12(9), 2577–2585.Google Scholar

Przepiórkowski, A., Piasecki, M., Jassem, K. and Fuglewicz, P. (2012). Computational Linguistics: Applications, vol. 458. Heidelberg, Germany: Springer.Google Scholar

Ptaszynski, M., Lempa, P., Masui, F., Kimura, Y., Rzepka, R., Araki, K., Wroczynski, M. and Leliwa, G. (2019). Brute-force sentence pattern extortion from harmful messages for cyberbullying detection. Journal of the Association for Information Systems 20(8), 4.Google Scholar

Putra, S.J., Gunawan, M.N. and Suryatno, A. (2018). Tokenization and n-gram for indexing Indonesian translation of the Quran. In 2018 6th International Conference on Information and Communication Technology (ICoICT). IEEE, pp. 158–161.CrossRef Google Scholar

Qian, Z., Li, P., Zhu, Q., Zhou, G., Luo, Z. and Luo, W. (2016). Speculation and negation scope detection via convolutional neural networks. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 815–825.CrossRef Google Scholar

Qudar, M. and Mago, V. (2020). A survey on language models. Technical report, Lakehead University, Thunder Bay, Ontario, Canada.Google Scholar

Raff, E., Fleming, W., Zak, R., Anderson, H., Finlayson, B., Nicholas, C. and McLean, M. (2019). Kilograms: Very large n-grams for malware classification. arXiv preprint arXiv:1908.00200.Google Scholar

Raff, E. and Nicholas, C. (2018). Hash-grams: Faster n-gram features for classification and malware detection. In Proceedings of the ACM Symposium on Document Engineering 2018, pp. 1–4.CrossRef Google Scholar

Rahm, E. and Do, H.H. (2000). Data cleaning: Problems and current approaches. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 23(4), 3–13.Google Scholar

Rahman, R. (2017). Detecting emotion from text and emoticon. London Journal of Research in Computer Science and Technology 17(3), 9–13.Google Scholar

Rajaraman, A. and Ullman, J.D. (2011). Mining of Massive Datasets. Cambridge, UK: Cambridge University Press.CrossRef Google Scholar

Rajput, B.S. and Khare, N. (2015). A survey of stemming algorithms for information retrieval. International Organization of Scientific Research – Journal of Computer Engineering 17(3), 76–80.Google Scholar

Raulji, J.K. and Saini, J.R. (2016). Stop-word removal algorithm and its implementation for Sanskrit language. International Journal of Computer Applications 150(2), 15–17.Google Scholar

Reber, U. (2019). Overcoming language barriers: Assessing the potential of machine translation and topic modeling for the comparative analysis of multilingual text corpora. Communication Methods and Measures 13(2), 102–125.CrossRef Google Scholar

Řehůřek, R. and Sojka, P. (2010). Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. LREC stands for Conference on Language Resources and Evaluation.Google Scholar

Reynar, J.C. and Ratnaparkhi, A. (1997). A maximum entropy approach to identifying sentence boundaries. In Proceedings of the Fifth Conference on Applied Natural Language Processing. Association for Computational Linguistics, pp. 16–19.CrossRef Google Scholar

Richardson, L. (2020). Beautiful Soup 4.9.1. Python library. Available from: https://www.crummy.com/software/BeautifulSoup/.Google Scholar

Rikters, M. and Bojar, O. (2017). Paying attention to multi-word expressions in neural machine translation. arXiv preprint arXiv:1710.06313.Google Scholar

Rinker, T.W. (2018a). lexicon: Lexicon Data.

$\mathsf{R}$ package version 1.2.1. Available from: http://github.com/trinker/lexicon.Google Scholar

Rinker, T.W. (2018b). textclean: Text Cleaning Tools.

$\mathsf{R}$ package version 0.9.3. Available from: https://github.com/trinker/textclean.Google Scholar

Roberts, M.E., Stewart, B.M., Tingley, D., Lucas, C., Leder-Luis, J., Gadarian, S.K., Albertson, B. and Rand, D.G. (2014). Structural topic models for open-ended survey responses. American Journal of Political Science 58(4), 1064–1082.CrossRef Google Scholar

Robinson, D. (2018). gutenbergr: Download and Process Public Domain Works from Project Gutenberg.

$\mathsf{R}$ package version 0.1.4. Available from: https://CRAN.R-project.org/package=gutenbergr.Google Scholar

Rosenthal, S. and McKeown, K. (2013). Columbia NLP: Sentiment detection of subjective phrases in social media. In Second Joint Conference on Lexical and Computational Semantics (SEM): Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), vol. 2, pp. 478–482.Google Scholar

Rosid, M.A., Fitrani, A.S., Astutik, I.R.I., Mulloh, N.I. and Gozali, H.A. (2020). Improving text preprocessing for student complaint document classification using Sastrawi. In IOP Conference Series: Materials Science and Engineering, vol. 874, 012017, Bristol, UK: IOP Publishing.CrossRef Google Scholar

Saal, F.E., Downey, R.G. and Lahey, M.A. (1980). Rating the ratings: Assessing the psychometric quality of rating data. Psychological Bulletin 88(2), 413.CrossRef Google Scholar

Sadvilkar, N. and Neumann, M. (2020). PySBD: Pragmatic sentence boundary disambiguation. arXiv preprint arXiv:2010.09657.Google Scholar

Sag, I.A., Baldwin, T., Bond, F., Copestake, A. and Flickinger, D. (2002). Multiword expressions: A pain in the neck for NLP. In International Conference on Intelligent Text Processing and Computational Linguistics. Springer, pp. 1–15.Google Scholar

Sailer, M. and Markantonatou, S. (2018). Multiword Expressions: Insights from a Multi-Lingual Perspective. Berlin, Germany: Language Science Press.Google Scholar

Samad, M.D., Khounviengxay, N.D. and Witherow, M.A. (2020). Effect of text processing steps on Twitter sentiment classification using word embedding. arXiv preprint arXiv:2007.13027.Google Scholar

Samir, A. and Lahbib, Z. (2018). Stemming and lemmatization for information retrieval systems in Amazigh language. In International Conference on Big Data, Cloud and Applications. Springer, pp. 222–233.CrossRef Google Scholar

Sánchez-Vega, F., Villatoro-Tello, E., Montes-y Gómez, M., Rosso, P., Stamatatos, E. and Villaseñor-Pineda, L. (2019). Paraphrase plagiarism identification with character-level features. Pattern Analysis and Applications 22(2), 669–681.CrossRef Google Scholar

Sandhaus, E. (2008). The New York Times annotated corpus. Linguistic Data Consortium, Philadelphia 6(12), e26752. Available from: https://catalog.ldc.upenn.edu/LDC2008T19.Google Scholar

Sarica, S. and Luo, J. (2020). Stopwords in technical language processing. arXiv preprint arXiv:2006.02633.Google Scholar

Sarkar, D. (2019). Text Analytics with Python: A Practitioner’s Guide to Natural Language Processing. Bangalore, India: Apress.CrossRef Google Scholar

Schofield, A., Magnusson, M. and Mimno, D. (2017). Pulling out the stops: Rethinking stopword removal for topic models. In Proceedings of the Fifteenth Conference of the European Chapter of the Association for Computational Linguistics, vol. 2, pp. 432–436.CrossRef Google Scholar

Schuman, H. and Presser, S. (1996). Questions and Answers in Attitude Surveys: Experiments on Question Form, Wording, and Context. Thousand Oaks, CA, USA: Sage.Google Scholar

Schuur, Y. (2020). Normalization for Dutch for Improved POS Tagging. Master’s Thesis, University of Groningen, Groningen, Netherlands.Google Scholar

Şeker, G.A. and Eryiğit, G. (2012). Initial explorations on using CRFs for Turkish named entity recognition. In Proceedings of COLING 2012, pp. 2459–2474. CRFs stand for conditional random fields, and COLING stands for International Conference on Computational Linguistics.Google Scholar

Seretan, V. (2011). Syntax-Based Collocation Extraction, vol. 44. Heidelberg, Germany: Springer Science & Business Media.CrossRef Google Scholar

Shaikh, A., More, D., Puttoo, R., Shrivastav, S. and Shinde, S. (2019). A survey paper on chatbots. International Research Journal of Engineering and Technology (IRJET) 6(4), 1786–1789.Google Scholar

Shalaby, W., Zadrozny, W. and Jin, H. (2019). Beyond word embeddings: Learning entity and concept representations from large scale knowledge bases. Information Retrieval Journal 22(6), 525–542.CrossRef Google Scholar

Silge, J. and Robinson, D. (2017). Text Mining with

$\mathsf{R}$ : A Tidy Approach. Sebastopol, CA, USA: O’Reilly Media Inc.Google Scholar

Sinclair, J. (2018). Collins English Dictionary, 13th Edn. New York NY, USA: HarperCollins.Google Scholar

Singh, V. and Saini, B. (2014). An effective tokenization algorithm for information retrieval systems. In First International Conference on Data Mining, pp. 109–119.CrossRef Google Scholar

Singhal, A. (2001). Modern information retrieval: A brief overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 24(4), 35–43.Google Scholar

Slioussar, N. (2011). Processing of a free word order language: The role of syntax and context. Journal of Psycholinguistic Research 40(4), 291–306.CrossRef Google Scholar PubMed

Søgaard, A., de Lhoneux, M. and Augenstein, I. (2018). Nightmare at test time: How punctuation prevents parsers from generalizing. arXiv preprint arXiv:1809.00070.Google Scholar

Soriano, J., Au, T. and Banks, D. (2013). Text mining in computational advertising. Statistical Analysis and Data Mining: The ASA Data Science Journal 6(4), 273–285. ASA stands for American Statistical Association.CrossRef Google Scholar

Soricut, R. and Marcu, D. (2003). Sentence level discourse parsing using syntactic and lexical information. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 228–235.CrossRef Google Scholar

Stamatatos, E. (2008). Author identification: Using text sampling to handle the class imbalance problem. Information Processing & Management 44(2), 790–799.CrossRef Google Scholar

Stamatatos, E. (2011). Plagiarism detection using stopword n-grams. Journal of the American Society for Information Science and Technology 62(12), 2512–2527.CrossRef Google Scholar

Sweeney, K. (2020). Unsupervised Machine Learning for Conference Scheduling: A Natural Language Processing Approach based on Latent Dirichlet Allocation. Master’s Thesis, NHH Norwegian School of Economics, Bergen, Norway.Google Scholar

Tabii, Y., Lazaar, M., Al Achhab, M. and Enneya, N. (2018). Big Data, Cloud and Applications (BDCA): Third International Conference, BDCA 2018, Kenitra, Morocco, 4–5 April 2018, Revised Selected Papers, vol, 872. Springer.CrossRef Google Scholar

Taboada, M., Meizoso, M., Martínez, D., Riano, D. and Alonso, A. (2013). Combining open-source natural language processing tools to parse clinical practice guidelines. Expert Systems 30(1), 3–11.CrossRef Google Scholar

Tan, F., Hu, Y., Hu, C., Li, K. and Yen, K. (2020). TNT: Text normalization based pre-training of transformers for content moderation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4735–4741.CrossRef Google Scholar

Tan, L. and Pal, S. (2014). Manawi: Using multi-word expressions and named entities to improve machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 201–206.CrossRef Google Scholar

Tang, C., Ross, K., Saxena, N. and Chen, R. (2011). What’s in a name: A study of names, gender inference, and gender behavior in Facebook. In International Conference on Database Systems for Advanced Applications. Springer, pp. 344–356.CrossRef Google Scholar

Tang, J., Hong, M., Zhang, D.L. and Li, J. (2008). Information extraction: Methodologies and applications. In Emerging Technologies of Text Mining: Techniques and Applications. USA: IGI Global, pp. 1–33.CrossRef Google Scholar

Temnikova, I., Nikolova, I., Baumgartner, W.A. Jr, Angelova, G. and Cohen, K.B. (2013). Closure properties of Bulgarian clinical text. In Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP), pp. 667–675.Google Scholar

Thanaki, J. (2017). Python Natural Language Processing. Birmingham, UK: Packt Publishing Ltd.Google Scholar

Thione, G. and van den Berg, M. (2007). Systems and methods for structural indexing of natural language text. Google Patents. US Patent Application No. 11/405,385.Google Scholar

Toral, A., Pecina, P., Wang, L. and Van Genabith, J. (2015). Linguistically-augmented perplexity-based data selection for language models. Computer Speech & Language 32(1), 11–26.CrossRef Google Scholar

Torres-Moreno, J.-M. (2012). Beyond stemming and lemmatization: Ultra-stemming to improve automatic text summarization. arXiv preprint arXiv:1209.3126.Google Scholar

Torres-Moreno, J.-M. (2014). Automatic Text Summarization. Hoboken, NJ, USA: John Wiley & Sons.CrossRef Google Scholar

Tran, D. and Sharma, D. (2005). Markov models for written language identification. In Proceedings of the 12th International Conference on Neural Information Processing, pp. 67–70.Google Scholar

Trieschnigg, D., Kraaij, W. and de Jong, F. (2007). The influence of basic tokenization on biomedical document retrieval. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 803–804.CrossRef Google Scholar

Truss, L. (2004). Eats, Shoots & Leaves: The Zero Tolerance Approach to Punctuation. New York, NY, USA: Penguin.Google Scholar

Truss, L. (2006). Eats, Shoots & Leaves: Why, Commas Really do make a Difference! New York, NY, USA: Penguin.Google Scholar

Tsarfaty, R., Seddah, D., Goldberg, Y., Kübler, S., Candito, M., Foster, J., Versley, Y., Rehbein, I. and Tounsi, L. (2010). Statistical parsing of morphologically rich languages (SPMRL): What, how and whither. In Proceedings of the NAACL-HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages. Association for Computational Linguistics, pp. 1–12. NAACL-HLT stands for the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.Google Scholar

Venhuizen, N.J., Bos, J., Hendriks, P. and Brouwer, H. (2018). Discourse semantics with information structure. Journal of Semantics 35(1), 127–169.CrossRef Google Scholar

Verma, R.M. and Marchette, D.J. (2019). Cybersecurity Analytics. New York, NY, USA: CRC Press.CrossRef Google Scholar

Verspoor, C.M., Joslyn, C. and Papcun, G.J. (2003). The gene ontology as a source of lexical semantic knowledge for a biological natural language processing application. In SIGIR Workshop on Text Analysis and Search for Bioinformatics, pp. 51–56.Google Scholar

Verspoor, K., Dvorkin, D., Cohen, K.B. and Hunter, L. (2009). Ontology quality assurance through analysis of term transformations. Bioinformatics 25(12), i77–i84.CrossRef Google Scholar PubMed

Vieweg, S., Hughes, A.L., Starbird, K. and Palen, L. (2010). Microblogging during two natural hazards events: What Twitter may contribute to situational awareness. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, pp. 1079–1088. SIGCHI stands for Special Interest Group on Computer-Human Interaction.CrossRef Google Scholar

Vijayarani, S., Ilamathi, J. and Nithya, V. (2015). Preprocessing techniques for text mining: An overview. International Journal of Computer Science & Communication Networks 5(1), 7–16.Google Scholar

Vilares, J., Vilares, M., Alonso, M.A. and Oakes, M.P. (2016). On the feasibility of character n-grams pseudo-translation for cross-language information retrieval tasks. Computer Speech & Language 36, 136–164.CrossRef Google Scholar

Villares-Maldonado R. (2017). A corpus-based analysis of non-standard English features in the microblogging platform Tumblr. EPiC Series in Language and Linguistics 2, 295–303.CrossRef Google Scholar

Wallach, H.M. (2006). Topic modeling: Beyond bag-of-words. In Proceedings of the Twenty-Third International Conference on Machine Learning. ACM, pp. 977–984.CrossRef Google Scholar

Wang, H., Liu, L., Song, W. and Lu, J. (2014a). Feature-based sentiment analysis approach for product reviews. Journal of Software 9(2), 274–279.Google Scholar

Wang, W., Gu, Z. and Tang, K. (2020). An improved algorithm for Bert. In Proceedings of the 2020 International Conference on Cyberspace Innovation of Advanced Technologies, pp. 116–120.CrossRef Google Scholar

Wang, X., McCallum, A. and Wei, X. (2007). Topical n-grams: Phrase and topic discovery, with an application to information retrieval. In Seventh IEEE International Conference on Data Mining (ICDM 2007). IEEE, pp. 697–702.CrossRef Google Scholar

Wang, X., Tokarchuk, L. and Poslad, S. (2014b). Identifying relevant event content for real-time event detection. In 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014), pp. 395–398.CrossRef Google Scholar

Wang, Y., Liu, S., Afzal, N., Rastegar-Mojarad, M., Wang, L., Shen, F., Kingsbury, P. and Liu, H. (2018). A comparison of word embeddings for the biomedical natural language processing. Journal of Biomedical Informatics 87, 12–20.CrossRef Google Scholar PubMed

Webster, J.J. and Kit, C. (1992). Tokenization as the initial phase in NLP. In Proceedings of the 14th Conference on Computational Linguistics, pp. 1106–1110.CrossRef Google Scholar

Wei, Z., Miao, D., Chauchat, J.-H., Zhao, R. and Li, W. (2009). N-grams based feature selection and text representation for Chinese text classification. International Journal of Computational Intelligence Systems 2(4), 365–374.Google Scholar

Welbers, K., Van Atteveldt, W. and Benoit, K. (2017). Text analysis in

$\mathsf{R}$ . Communication Methods and Measures 11(4), 245–265.CrossRef Google Scholar

Wermter, J. and Hahn, U. (2005). Paradigmatic modifiability statistics for the extraction of complex multi-word terms. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pp. 843–850.CrossRef Google Scholar

White, L., Togneri, R., Liu, W. and Bennamoun, M. (2018). Neural Representations of Natural Language, vol. 783. Singapore: Springer.Google Scholar

White, R.W. and Morris, D. (2007). Investigating the querying and browsing behavior of advanced search engine users. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 255–262.CrossRef Google Scholar

Widyassari, A.P., Rustad, S., Shidik, G.F., Noersasongko, E., Syukur, A. and Affandy, A. (2022). Review of automatic text summarization techniques & methods. Journal of King Saud University-Computer and Information Sciences 34(4), 1029–1046.CrossRef Google Scholar

Wijffels, J. (2021). udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the ‘UDPipe’ ‘NLP’ Toolkit.

$\mathsf{R}$ package version 0.8.8. Available from: https://cran.r-project.org/web/packages/udpipe/udpipe.pdf.Google Scholar

Wilson, K.S. (2009). Database search control. US Patent Application No. 12/031,701.Google Scholar

Wong, D.F., Chao, L.S. and Zeng, X. (2014). iSentenizer-

$\mu$ : Multilingual sentence boundary detection model. The Scientific World Journal 2014, 1–10.Google Scholar

Woo, H., Kim, J. and Lee, W. (2020). Validation of text data preprocessing using a neural network model. Mathematical Problems in Engineering 2020, 1–9.CrossRef Google Scholar

Xue, J., Chen, J., Hu, R., Chen, C., Zheng, C. and Zhu, T. (2020). Twitter discussions and concerns about COVID-19 pandemic: Twitter data analysis using a machine learning approach. arXiv preprint arXiv: 2005.12830.Google Scholar

Yamada, I., Takeda, H. and Takefuji, Y. (2015). Enhancing named entity recognition in Twitter messages using entity linking. In Proceedings of the Workshop on Noisy User-Generated Text, pp. 136–140.CrossRef Google Scholar

Yang, Y. and Wilbur, J. (1996). Using corpus statistics to remove redundant words in text categorization. Journal of the American Society for Information Science 47(5), 357–369.3.0.CO;2-V>CrossRef Google Scholar

Yazdani, A., Ghazisaeedi, M., Ahmadinejad, N., Giti, M., Amjadi, H. and Nahvijou, A. (2020). Automated misspelling detection and correction in Persian clinical text. Journal of Digital Imaging 33(3), 555–562.CrossRef Google Scholar PubMed

Ye, X., Shen, H., Ma, X., Bunescu, R. and Liu, C. (2016). From word embeddings to document similarities for improved information retrieval in software engineering. In Proceedings of the 38th International Conference on Software Engineering, pp. 404–415.CrossRef Google Scholar

Yeh, K.C. (2003). Bilingual sentence alignment based on punctuation marks. In Proceedings of the ROCLING 2003 Student Workshop, pp. 303–312. ROCLING stands for Conference on Computational Linguistics and Speech Processing.Google Scholar

Yin, D., Ren, F., Jiang, P. and Kuroiwa, S. (2007). Chinese complex long sentences processing method for Chinese-Japanese machine translation. In 2007 International Conference on Natural Language Processing and Knowledge Engineering. IEEE, pp. 170–175.CrossRef Google Scholar

Yolchuyeva, S., Németh, G. and Gyires-Tóth, B. (2018). Text normalization with convolutional neural networks. International Journal of Speech Technology 21(3), 589–600.CrossRef Google Scholar

Yuhang, Z., Yue, W. and Wei, Y. (2010). Research on data cleaning in text clustering. In 2010 International Forum on Information Technology and Applications, vol. 1. IEEE, pp. 305–307.CrossRef Google Scholar

Zaman, A., Matsakis, P. and Brown, C. (2011). Evaluation of stop word lists in text retrieval using latent semantic indexing. In 2011 Sixth International Conference on Digital Information Management. IEEE, pp. 133–136.CrossRef Google Scholar

Zeng, C., Li, S., Li, Q., Hu, J. and Hu, J. (2020). A survey on machine reading comprehension–tasks, evaluation metrics and benchmark datasets. Applied Sciences 10(21), 7640.CrossRef Google Scholar

Zeng, Z., Xu, H., Chong, T.Y., Chng, E.-S. and Li, H. (2017). Improving N-gram language modeling for code-switching speech recognition. In 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, pp. 1596–1601.CrossRef Google Scholar

Zhang, C., Baldwin, T., Ho, H., Kimelfeld, B. and Li, Y. (2013). Adaptive parser-centric text normalization. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1159–1168.Google Scholar

Zhang, H. (2008). Exploring regularity in source code: Software science and Zipf’s law. In 2008 15th Working Conference on Reverse Engineering. IEEE, pp. 101–110.CrossRef Google Scholar

Zhang, J. and El-Gohary, N.M. (2017). Integrating semantic NLP and logic reasoning into a unified system for fully-automated code checking. Automation in Construction 73, 45–57.CrossRef Google Scholar

Zhang, L. and Komachi, M. (2018). Neural machine translation of logographic languages using sub-character level information. arXiv preprint arXiv:1809.02694.Google Scholar

Zhang, X. and LeCun, Y. (2017). Which encoding is the best for text classification in Chinese, English, Japanese and Korean? arXiv preprint arXiv:1708.02657.Google Scholar

Zhang, Y., Chen, Q., Yang, Z., Lin, H. and Lu, Z. (2019). BioWordVec, improving biomedical word embeddings with subword information and MeSH. Scientific Data 6(1), 1–9.CrossRef Google Scholar PubMed

Zhou, L., Siddiqui, T., Seliger, S.L., Blumenthal, J.B., Kang, Y., Doerfler, R. and Fink, J.C. (2019). Text preprocessing for improving hypoglycemia detection from clinical notes – A case study of patients with diabetes. International Journal of Medical Informatics 129, 374–380.CrossRef Google Scholar PubMed

Zhou, M., Duan, N., Liu, S. and Shum, H.-Y. (2020). Progress in neural NLP: Modeling, learning, and reasoning. Engineering 6(3), 275–290.CrossRef Google Scholar

Zipf, G.K. (1949). Human behavior and the principle of least effort.Google Scholar

Zubiaga, A., Spina, D., Marítnez, R. and Fresno, V. (2015). Real-time classification of Twitter trends. Journal of the Association for Information Science and Technology 66(3), 462–473.CrossRef Google Scholar

Zupon, A., Crew, E. and Ritchie, S. (2021). Text normalization for low-resource languages of Africa. arXiv preprint arXiv:2103.15845.Google Scholar

Zweigenbaum, P. and Grabar, N. (2002a). Accenting unknown words: Application to the French version of the MeSH. In Workshop NLP in Biomedical Applications, EFMI, Cyprus, 69á. EFMI stands for European Federation for Medical Informatics.Google Scholar

Zweigenbaum, P. and Grabar, N. (2002b). Restoring accents in unknown biomedical words: Application to the French MeSH thesaurus. International Journal of Medical Informatics 67(1–3), 113–126.CrossRef Google Scholar

Article contents

Comparison of text preprocessing methods

Abstract

Keywords

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests