A clustering framework for lexical normalization of Roman Urdu

Abdul Rafae Khan; Asim Karim; Hassan Sajjad; Faisal Kamiran; Jia Xu

doi:10.1017/S1351324920000285

A clustering framework for lexical normalization of Roman Urdu

Published online by Cambridge University Press: 10 June 2020

Abdul Rafae Khan ,

Asim Karim ,

Hassan Sajjad ,

Faisal Kamiran and

Jia Xu

Show author details

Abdul Rafae Khan*: Affiliation:
Stevens Institute of Technology, Hoboken, NJ 07030, USA The Graduate Center, Computer Science Department, City University of New York, 365 5th Ave, New York, NY 10016, USA
Asim Karim: Affiliation:
Lahore University of Management Sciences, D.H.A, Lahore Cantt., 54792 Lahore, Pakistan
Hassan Sajjad: Affiliation:
Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
Faisal Kamiran: Affiliation:
Information Technology University, Arfa Software Technology Park, Ferozepur Road, Lahore, Pakistan
Jia Xu: Affiliation:
Stevens Institute of Technology, Hoboken, NJ 07030, USA The Graduate Center, Computer Science Department, City University of New York, 365 5th Ave, New York, NY 10016, USA
*: *Corresponding author. E-mail: rafae015@gmail.com

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Roman Urdu is an informal form of the Urdu language written in Roman script, which is widely used in South Asia for online textual content. It lacks standard spelling and hence poses several normalization challenges during automatic language processing. In this article, we present a feature-based clustering framework for the lexical normalization of Roman Urdu corpora, which includes a phonetic algorithm UrduPhone, a string matching component, a feature-based similarity function, and a clustering algorithm Lex-Var. UrduPhone encodes Roman Urdu strings to their pronunciation-based representations. The string matching component handles character-level variations that occur when writing Urdu using Roman script. The similarity function incorporates various phonetic-based, string-based, and contextual features of words. The Lex-Var algorithm is a variant of the k-medoids clustering algorithm that groups lexical variations of words. It contains a similarity threshold to balance the number of clusters and their maximum similarity. The framework allows feature learning and optimization in addition to the use of predefined features and weights. We evaluate our framework extensively on four real-world datasets and show an F-measure gain of up to 15% from baseline methods. We also demonstrate the superiority of UrduPhone and Lex-Var in comparison to respective alternate algorithms in our clustering framework for the lexical normalization of Roman Urdu.

Keywords

Text data mining Similarity Machine learning Phonetic encoding

Type: Article
Information: Natural Language Engineering , Volume 28 , Issue 1 , January 2022 , pp. 93 - 123

DOI: https://doi.org/10.1017/S1351324920000285 [Opens in a new window]
Copyright: © The Author(s), 2020. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Ahmed, T. (2009). Roman to Urdu transliteration using wordlist. In Proceedings of the Conference on Language and Technology, Poznan, Poland.Google Scholar

Almeida, T.A., Silva, T.P., Santos, I. and Gómez Hidalgo, J.M. (2016). Text normalization and semantic indexing to enhance instant messaging and SMS spam filtering. Knowledge-Based Systems 108(C), 25–32.CrossRef Google Scholar

Bagga, A. and Baldwin, B. (1998). Algorithms for scoring coreference chains. In Proceedings of the 1st International Conference on Language Resources and Evaluation Workshop on Linguistics Coreference, Granada, Spain, pp. 563–566.Google Scholar

Brown, P.F., deSouza, P.V., Mercer, R.L., Pietra, V.J.D. and Lai, J.C. (1992). Class-based n-gram models of natural language. Computational Linguistics 18(4), 467–479.Google Scholar

Chrupała, G. (2013). Text segmentation with character-level text embeddings. In Proceedings of the International Conference on Machine Learning: Workshop on Deep Learning for Audio, Speech and Language Processing, Atlanta, Georgia, USA.Google Scholar

Chrupała, G. (2014). Normalizing tweets with edit scripts and recurrent neural embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: Short Papers, Baltimore, Maryland, USA. Association for Computational Linguistics, pp. 680–686.CrossRef Google Scholar

Clark, E. and Araki, K. (2011). Text normalization in social media: progress, problems and applications for a pre-processing system of casual english. Procedia-Social and Behavioral Sciences 27, 2–11.CrossRef Google Scholar

Contractor, D., Faruquie, T.A. and Subramaniam, L.V. (2010). Unsupervised cleansing of noisy text. In Proceedings of the 23rd International Conference on Computational Linguistics: Poster, Beijing, China. Association for Computational Linguistics, pp. 189–196.Google Scholar

Costa Bertaglia, T.F. and Volpe Nunes, M.D.G. (2016). Exploring word embeddings for unsupervised textual user-generated content normalization. In Proceedings of the 2nd Workshop on Noisy User-generated Text, Osaka, Japan, pp. 112–120.Google Scholar

Damerau, F.J. (1964). A technique for computer detection and correction of spelling errors. Communications of the Association for Computing Machinery 7(3), 171–176.CrossRef Google Scholar

Derczynski, L., Ritter, A., Clark, S. and Bontcheva, K. (2013). Twitter part-of-speech tagging for all: overcoming sparse and noisy data. In Proceedings of the International Conference on Recent Advances in Natural Language Processing, Hissar, Bulgaria. Association for Computational Linguistics.Google Scholar

Desai, N. and Narvekar, M. (2015). Normalization of noisy text data. Procedia Computer Science 45, 127–132.CrossRef Google Scholar

Durrani, N. and Hussain, S. (2010). Urdu word segmentation. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics – Human Language Technologies, Los Angeles, California, USA. Association for Computational Linguistics, pp. 528–536.Google Scholar

Durrani, N., Sajjad, H., Fraser, A. and Schmid, H. (2010). Hindi-to-Urdu machine translation through transliteration. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden. Association for Computational Linguistics, pp. 465–474.Google Scholar

Fossati, D. and Di Eugenio, B. (2007). A mixed trigrams approach for context sensitive spell checking. In Computational Linguistics and Intelligent Text Processing. Springer, pp. 623–633.CrossRef Google Scholar

Gouws, S., Hovy, D. and Metzler, D. (2011). Unsupervised mining of lexical variants from noisy text. In Proceedings of the 1st Workshop on Unsupervised Learning in Natural Language Processing, Edinburgh, Scotland. Association for Computational Linguistics, pp. 82–90.Google Scholar

Hall, P.A.V. and Dowling, G.R. (1980). Approximate string matching. Association for Computing Machinery Computing Surveys 12(4), 381–402.Google Scholar

Han, B., Cook, P. and Baldwin, T. (2012). Automatically constructing a normalisation dictionary for microblogs. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jeju Island, Korea. Association for Computational Linguistics, pp. 421–432.Google Scholar

Han, B., Cook, P. and Baldwin, T. (2013). Lexical normalization for social media text. Association for Computing Machinery Transactions on Intelligent Systems and Technology 4(1), 5.Google Scholar

Han, J. (2005). Data Mining: Concepts and Techniques. San Francisco, California: Morgan Kaufmann Publishers Inc.Google Scholar

Hassan, H. and Menezes, A. (2013). Social text normalization using contextual graph random walks. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria.Google Scholar

Hassan, M.T., Junejo, K.N. and Karim, A. (2009). Learning and predicting key web navigation patterns using bayesian models. In Computational Science and Its Applications. Springer, pp. 877–887.CrossRef Google Scholar

Irvine, A., Weese, J. and Callison-Burch, C. (2012). Processing informal, romanized Pakistani text messages. In Proceedings of the 2nd Workshop on Language in Social Media, LSM’12, Montreal, Canada. Association for Computational Linguistics, pp. 75–78.Google Scholar

Jiampojamarn, S., Kondrak, G. and Sherif, T. (2007). Applying many-to-many alignments and hidden markov models to letter-to-phoneme conversion. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics – Human Language Technologies: Main Conference, Rochester, New York, USA. Association for Computational Linguistics, pp. 372–379.Google Scholar

Jin, N. (2015). Ncsu-sas-ning: candidate generation and feature engineering for supervised lexical normalization. In Proceedings of the Workshop on Noisy User-generated Text, Beijing, China. Association for Computational Linguistics, pp. 87–92.CrossRef Google Scholar

Kaufmann, M. and Kalita, J. (2010). Syntactic normalization of twitter messages. In Proceedings of the International Conference on Natural Language Processing, Kharagpur, India.Google Scholar

Khan, O. and Karim, A. (2012). A rule-based model for normalization of SMS text. In Proceedings of Institute of Electrical and Electronics Engineers 24th International Conference on Tools with Artificial Intelligence, Athens, Greece, pp. 634–641.CrossRef Google Scholar

Knuth, D.E. (1973). The Art of Computer Programming: Volume 3, Sorting and Searching. Addison-Wesley.Google Scholar

Lafferty, J. (2001). Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning, San Francisco, California, USA, pp. 282–289.Google Scholar

Levandowsky, M. and Winter, D. (1971). Distance between sets. Nature 234, 34–35.CrossRef Google Scholar

Lewis, M.P. (ed) (2009). Ethnologue: Languages of the World, 16th Edn. Dallas, Texas, USA: SIL International.Google Scholar

Li, C. and Liu, Y. (2014). Improving text normalization via unsupervised model and discriminative reranking. In Proceedings of the Association for Computational Linguistics: Student Research Workshop, Baltimore, Maryland, USA. Association for Computational Linguistics, pp. 86–93.CrossRef Google Scholar

Ling, W., Dyer, C., Black, A.W. and Trancoso, I. (2013). Paraphrasing 4 microblog normalization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp. 73–84.Google Scholar

Lusetti, M., Ruzsics, T., Göhring, A., Samardžić, T. and Stark, E. (2018). Encoder-decoder methods for text normalization. In Proceedings of the 5th Workshop on Natural Language Processing for Similar Languages, Varieties and Dialects. Association for Computational Linguistics, pp. 18–28.Google Scholar

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. and Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems, Red Hook, New York, USA. Curran Associates Inc., pp. 3111–3119.Google Scholar

Naseem, T. (2004). A Hybrid Approach for Urdu Spell Checking. Master’s Thesis, Master of Science (Computer Science) thesis at the National University of Computer & Emerging Sciences.Google Scholar

Naseem, T. and Hussain, S. (2007). A novel approach for ranking spelling error corrections for Urdu. Language Resources and Evaluation 41(2), 117–128.CrossRef Google Scholar

Nelder, J.A. and Mead, R. (1965). A simplex method for function minimization. Computer Journal 7, 308–313.CrossRef Google Scholar

Och, F.J. and Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics 29(1), 19–51.CrossRef Google Scholar

Paltoglou, G. and Thelwall, M. (2012). Twitter, MySpace, Digg: unsupervised sentiment analysis in social media. Association for Computing Machinery Transactions on Intelligent Systems and Technology 3(4), 66.Google Scholar

Pennell, D. and Liu, Y. (2011). A character level machine translation approach for normalization of SMS abbreviations. In Proceedings of the 5th International Joint Conference on Natural Language Processing, Chiang Mai, Thailand, pp. 974–982.Google Scholar

Philips, L. (1990). Hanging on the metaphone. Computer Language Magazine 7(12), 39–44.Google Scholar

Pinto, D., Ayala, D.V., Alemán, Y., Gómez-Adorno, H., Loya, N. and Jiménez-Salazar, H. (2012). The soundex phonetic algorithm revisited for SMS text representation. In Proceedings of the 15th International Conference on Text, Speech and Dialogue, Brno, Czech Republic, pp. 47–55.CrossRef Google Scholar

Prochasson, E., Viard-Gaudin, C. and Morin, E. (2007). Language models for handwritten short message services. In Proceedings of the 9th International Conference on Document Analysis and Recognition, Parana, Brazil. Institute of Electrical and Electronics Engineers Computer Society, pp. 83–87.CrossRef Google Scholar

Rafae, A., Qayyum, A., Uddin, M.M., Karim, A., Sajjad, H. and Kamiran, F. (2015). An unsupervised method for discovering lexical variations in Roman Urdu informal text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal. Association for Computational Linguistics, pp. 823–828.CrossRef Google Scholar

Rangarajan Sridhar, V.K. (2015). Unsupervised text normalization using distributed representations of words and phrases. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, Denver, Colorado, USA. Association for Computational Linguistics, pp. 8–16.CrossRef Google Scholar

Rangarajan Sridhar, V.K., Chen, J., Bangalore, S. and Shacham, R. (2014). A framework for translating SMS messages. In Proceedings of the 25th International Conference on Computational Linguistics: Technical Papers.Google Scholar

Ristad, E.S. and Yianilos, P.N. (1998). Learning string edit distance. Institute of Electrical and Electronics Engineers Transactions on Pattern Recognition and Machine Intelligence 20(5), 522–532.Google Scholar

Roy, S., Dhar, S., Bhattacharjee, S. and Das, A. (2013). A lexicon based algorithm for noisy text normalization as pre-processing for sentiment analysis. International Journal of Research in Engineering and Technology 02, 67–70.Google Scholar

Sajjad, H., Fraser, A. and Schmid, H. (2011). An algorithm for unsupervised transliteration mining with an application to word alignment. In Proceedings of the 49th Conference of the Association for Computational Linguistics – Human Language Technologies, Portland, Oregon, USA.Google Scholar

Sajjad, H. and Schmid, H. (2009). Tagging Urdu text with parts of speech: a tagger comparison. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, Athens, Greece, pp. 692–700.CrossRef Google Scholar

Sajjad, H., Schmid, H., Fraser, A. and Schütze, H. (2017). Statistical models for unsupervised, semi-supervised and supervised transliteration mining. Computational Linguistics 43(2), 349–375.CrossRef Google Scholar

Sidarenka, U., Scheffler, T. and Stede, M. (2013). Rule-based normalization of German twitter messages. In Proceedings of the Gesellschaft fur Sprachtechnologie und Computerlinguistik Workshop Verarbeitung und Annotation von Sprachdaten aus Genres internetbasierter Kommunikation, Darmstadt, Germany.Google Scholar

Singh, R., Choudhary, N. and Shrivastava, M. (2018). Automatic normalization of word variations in code-mixed social media text. Computing Research Repository, .Google Scholar

Sproat, R., Black, A.W., Chen, S.F., Kumar, S., Ostendorf, M. and Richards, C. (2001). Normalization of non-standard words. Computer Speech & Language 15(3), 287–333.CrossRef Google Scholar

Sproat, R. and Jaitly, N. (2017). An RNN model of text normalization. In Proceedings of Interspeech, Stockholm, Sweden.CrossRef Google Scholar

Taft, R. (1970). Name search techniques. Special report. Bureau of Systems Development, New York State Identification and Intelligence System.Google Scholar

Vilain, M., Burger, J., Aberdeen, J., Connolly, D. and Hirschman, L. (1995). A model-theoretic coreference scoring scheme. In Proceedings of the 6th Message Understanding Conference, Columbia, Maryland, USA, pp. 45–52.CrossRef Google Scholar

Wang, J. (ed) (2009). Encyclopedia of Data Warehousing and Mining, 2nd Edn. (4 Volumes). Hershey, Pennsylvania, USA, IGI Global.CrossRef Google Scholar

Wang, P. and Ng, H.T. (2013). A beam-search decoder for normalization of social media text with application to machine translation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics – Human Language Technologies, Atlanta, Georgia, USA. Association for Computational Linguistics, pp. 471–481.Google Scholar

Wei, Z., Zhou, L., Li, B., Wong, K.-F., Gao, W. and Wong, K.-F. (2011). Exploring tweets normalization and query time sensitivity for twitter search. In Proceedings of the 20th Text Retrieval Conference, Gaithersburg, Maryland, USA.Google Scholar

Yang, Y. and Eisenstein, J. (2013). A log-linear model for unsupervised text normalization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing: Meeting of SIGDAT, a Special Interest Group of the Association for Computational Linguistics, Seattle, Washington, USA, pp. 61–72.Google Scholar

Zhang, X., Song, J., He, Y. and Fu, G. (2015). Normalization of homophonic words in Chinese microblogs. In Intelligent Computation in Big Data Era. Springer, pp. 177–187.CrossRef Google Scholar

Article contents

A clustering framework for lexical normalization of Roman Urdu

Abstract

Keywords

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests