Hostname: page-component-848d4c4894-x5gtn Total loading time: 0 Render date: 2024-05-24T10:48:27.032Z Has data issue: false hasContentIssue false

A clustering framework for lexical normalization of Roman Urdu

Published online by Cambridge University Press:  10 June 2020

Abdul Rafae Khan*
Affiliation:
Stevens Institute of Technology, Hoboken, NJ 07030, USA The Graduate Center, Computer Science Department, City University of New York, 365 5th Ave, New York, NY 10016, USA
Asim Karim
Affiliation:
Lahore University of Management Sciences, D.H.A, Lahore Cantt., 54792 Lahore, Pakistan
Hassan Sajjad
Affiliation:
Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
Faisal Kamiran
Affiliation:
Information Technology University, Arfa Software Technology Park, Ferozepur Road, Lahore, Pakistan
Jia Xu
Affiliation:
Stevens Institute of Technology, Hoboken, NJ 07030, USA The Graduate Center, Computer Science Department, City University of New York, 365 5th Ave, New York, NY 10016, USA
*
*Corresponding author. E-mail: rafae015@gmail.com

Abstract

Roman Urdu is an informal form of the Urdu language written in Roman script, which is widely used in South Asia for online textual content. It lacks standard spelling and hence poses several normalization challenges during automatic language processing. In this article, we present a feature-based clustering framework for the lexical normalization of Roman Urdu corpora, which includes a phonetic algorithm UrduPhone, a string matching component, a feature-based similarity function, and a clustering algorithm Lex-Var. UrduPhone encodes Roman Urdu strings to their pronunciation-based representations. The string matching component handles character-level variations that occur when writing Urdu using Roman script. The similarity function incorporates various phonetic-based, string-based, and contextual features of words. The Lex-Var algorithm is a variant of the k-medoids clustering algorithm that groups lexical variations of words. It contains a similarity threshold to balance the number of clusters and their maximum similarity. The framework allows feature learning and optimization in addition to the use of predefined features and weights. We evaluate our framework extensively on four real-world datasets and show an F-measure gain of up to 15% from baseline methods. We also demonstrate the superiority of UrduPhone and Lex-Var in comparison to respective alternate algorithms in our clustering framework for the lexical normalization of Roman Urdu.

Type
Article
Copyright
© The Author(s), 2020. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Ahmed, T. (2009). Roman to Urdu transliteration using wordlist. In Proceedings of the Conference on Language and Technology, Poznan, Poland.Google Scholar
Almeida, T.A., Silva, T.P., Santos, I. and Gómez Hidalgo, J.M. (2016). Text normalization and semantic indexing to enhance instant messaging and SMS spam filtering. Knowledge-Based Systems 108(C), 2532.CrossRefGoogle Scholar
Bagga, A. and Baldwin, B. (1998). Algorithms for scoring coreference chains. In Proceedings of the 1st International Conference on Language Resources and Evaluation Workshop on Linguistics Coreference, Granada, Spain, pp. 563566.Google Scholar
Brown, P.F., deSouza, P.V., Mercer, R.L., Pietra, V.J.D. and Lai, J.C. (1992). Class-based n-gram models of natural language. Computational Linguistics 18(4), 467479.Google Scholar
Chrupała, G. (2013). Text segmentation with character-level text embeddings. In Proceedings of the International Conference on Machine Learning: Workshop on Deep Learning for Audio, Speech and Language Processing, Atlanta, Georgia, USA.Google Scholar
Chrupała, G. (2014). Normalizing tweets with edit scripts and recurrent neural embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: Short Papers, Baltimore, Maryland, USA. Association for Computational Linguistics, pp. 680686.CrossRefGoogle Scholar
Clark, E. and Araki, K. (2011). Text normalization in social media: progress, problems and applications for a pre-processing system of casual english. Procedia-Social and Behavioral Sciences 27, 211.CrossRefGoogle Scholar
Contractor, D., Faruquie, T.A. and Subramaniam, L.V. (2010). Unsupervised cleansing of noisy text. In Proceedings of the 23rd International Conference on Computational Linguistics: Poster, Beijing, China. Association for Computational Linguistics, pp. 189196.Google Scholar
Costa Bertaglia, T.F. and Volpe Nunes, M.D.G. (2016). Exploring word embeddings for unsupervised textual user-generated content normalization. In Proceedings of the 2nd Workshop on Noisy User-generated Text, Osaka, Japan, pp. 112120.Google Scholar
Damerau, F.J. (1964). A technique for computer detection and correction of spelling errors. Communications of the Association for Computing Machinery 7(3), 171176.CrossRefGoogle Scholar
Derczynski, L., Ritter, A., Clark, S. and Bontcheva, K. (2013). Twitter part-of-speech tagging for all: overcoming sparse and noisy data. In Proceedings of the International Conference on Recent Advances in Natural Language Processing, Hissar, Bulgaria. Association for Computational Linguistics.Google Scholar
Desai, N. and Narvekar, M. (2015). Normalization of noisy text data. Procedia Computer Science 45, 127132.CrossRefGoogle Scholar
Durrani, N. and Hussain, S. (2010). Urdu word segmentation. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics – Human Language Technologies, Los Angeles, California, USA. Association for Computational Linguistics, pp. 528536.Google Scholar
Durrani, N., Sajjad, H., Fraser, A. and Schmid, H. (2010). Hindi-to-Urdu machine translation through transliteration. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden. Association for Computational Linguistics, pp. 465474.Google Scholar
Fossati, D. and Di Eugenio, B. (2007). A mixed trigrams approach for context sensitive spell checking. In Computational Linguistics and Intelligent Text Processing. Springer, pp. 623633.CrossRefGoogle Scholar
Gouws, S., Hovy, D. and Metzler, D. (2011). Unsupervised mining of lexical variants from noisy text. In Proceedings of the 1st Workshop on Unsupervised Learning in Natural Language Processing, Edinburgh, Scotland. Association for Computational Linguistics, pp. 8290.Google Scholar
Hall, P.A.V. and Dowling, G.R. (1980). Approximate string matching. Association for Computing Machinery Computing Surveys 12(4), 381402.Google Scholar
Han, B., Cook, P. and Baldwin, T. (2012). Automatically constructing a normalisation dictionary for microblogs. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jeju Island, Korea. Association for Computational Linguistics, pp. 421432.Google Scholar
Han, B., Cook, P. and Baldwin, T. (2013). Lexical normalization for social media text. Association for Computing Machinery Transactions on Intelligent Systems and Technology 4(1), 5.Google Scholar
Han, J. (2005). Data Mining: Concepts and Techniques. San Francisco, California: Morgan Kaufmann Publishers Inc.Google Scholar
Hassan, H. and Menezes, A. (2013). Social text normalization using contextual graph random walks. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria.Google Scholar
Hassan, M.T., Junejo, K.N. and Karim, A. (2009). Learning and predicting key web navigation patterns using bayesian models. In Computational Science and Its Applications. Springer, pp. 877887.CrossRefGoogle Scholar
Irvine, A., Weese, J. and Callison-Burch, C. (2012). Processing informal, romanized Pakistani text messages. In Proceedings of the 2nd Workshop on Language in Social Media, LSM’12, Montreal, Canada. Association for Computational Linguistics, pp. 7578.Google Scholar
Jiampojamarn, S., Kondrak, G. and Sherif, T. (2007). Applying many-to-many alignments and hidden markov models to letter-to-phoneme conversion. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics – Human Language Technologies: Main Conference, Rochester, New York, USA. Association for Computational Linguistics, pp. 372379.Google Scholar
Jin, N. (2015). Ncsu-sas-ning: candidate generation and feature engineering for supervised lexical normalization. In Proceedings of the Workshop on Noisy User-generated Text, Beijing, China. Association for Computational Linguistics, pp. 8792.CrossRefGoogle Scholar
Kaufmann, M. and Kalita, J. (2010). Syntactic normalization of twitter messages. In Proceedings of the International Conference on Natural Language Processing, Kharagpur, India.Google Scholar
Khan, O. and Karim, A. (2012). A rule-based model for normalization of SMS text. In Proceedings of Institute of Electrical and Electronics Engineers 24th International Conference on Tools with Artificial Intelligence, Athens, Greece, pp. 634641.CrossRefGoogle Scholar
Knuth, D.E. (1973). The Art of Computer Programming: Volume 3, Sorting and Searching. Addison-Wesley.Google Scholar
Lafferty, J. (2001). Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning, San Francisco, California, USA, pp. 282289.Google Scholar
Levandowsky, M. and Winter, D. (1971). Distance between sets. Nature 234, 3435.CrossRefGoogle Scholar
Lewis, M.P. (ed) (2009). Ethnologue: Languages of the World, 16th Edn. Dallas, Texas, USA: SIL International.Google Scholar
Li, C. and Liu, Y. (2014). Improving text normalization via unsupervised model and discriminative reranking. In Proceedings of the Association for Computational Linguistics: Student Research Workshop, Baltimore, Maryland, USA. Association for Computational Linguistics, pp. 8693.CrossRefGoogle Scholar
Ling, W., Dyer, C., Black, A.W. and Trancoso, I. (2013). Paraphrasing 4 microblog normalization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp. 7384.Google Scholar
Lusetti, M., Ruzsics, T., Göhring, A., Samardžić, T. and Stark, E. (2018). Encoder-decoder methods for text normalization. In Proceedings of the 5th Workshop on Natural Language Processing for Similar Languages, Varieties and Dialects. Association for Computational Linguistics, pp. 1828.Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. and Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems, Red Hook, New York, USA. Curran Associates Inc., pp. 31113119.Google Scholar
Naseem, T. (2004). A Hybrid Approach for Urdu Spell Checking. Master’s Thesis, Master of Science (Computer Science) thesis at the National University of Computer & Emerging Sciences.Google Scholar
Naseem, T. and Hussain, S. (2007). A novel approach for ranking spelling error corrections for Urdu. Language Resources and Evaluation 41(2), 117128.CrossRefGoogle Scholar
Nelder, J.A. and Mead, R. (1965). A simplex method for function minimization. Computer Journal 7, 308313.CrossRefGoogle Scholar
Och, F.J. and Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics 29(1), 1951.CrossRefGoogle Scholar
Paltoglou, G. and Thelwall, M. (2012). Twitter, MySpace, Digg: unsupervised sentiment analysis in social media. Association for Computing Machinery Transactions on Intelligent Systems and Technology 3(4), 66.Google Scholar
Pennell, D. and Liu, Y. (2011). A character level machine translation approach for normalization of SMS abbreviations. In Proceedings of the 5th International Joint Conference on Natural Language Processing, Chiang Mai, Thailand, pp. 974982.Google Scholar
Philips, L. (1990). Hanging on the metaphone. Computer Language Magazine 7(12), 3944.Google Scholar
Pinto, D., Ayala, D.V., Alemán, Y., Gómez-Adorno, H., Loya, N. and Jiménez-Salazar, H. (2012). The soundex phonetic algorithm revisited for SMS text representation. In Proceedings of the 15th International Conference on Text, Speech and Dialogue, Brno, Czech Republic, pp. 4755.CrossRefGoogle Scholar
Prochasson, E., Viard-Gaudin, C. and Morin, E. (2007). Language models for handwritten short message services. In Proceedings of the 9th International Conference on Document Analysis and Recognition, Parana, Brazil. Institute of Electrical and Electronics Engineers Computer Society, pp. 8387.CrossRefGoogle Scholar
Rafae, A., Qayyum, A., Uddin, M.M., Karim, A., Sajjad, H. and Kamiran, F. (2015). An unsupervised method for discovering lexical variations in Roman Urdu informal text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal. Association for Computational Linguistics, pp. 823828.CrossRefGoogle Scholar
Rangarajan Sridhar, V.K. (2015). Unsupervised text normalization using distributed representations of words and phrases. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, Denver, Colorado, USA. Association for Computational Linguistics, pp. 816.CrossRefGoogle Scholar
Rangarajan Sridhar, V.K., Chen, J., Bangalore, S. and Shacham, R. (2014). A framework for translating SMS messages. In Proceedings of the 25th International Conference on Computational Linguistics: Technical Papers.Google Scholar
Ristad, E.S. and Yianilos, P.N. (1998). Learning string edit distance. Institute of Electrical and Electronics Engineers Transactions on Pattern Recognition and Machine Intelligence 20(5), 522532.Google Scholar
Roy, S., Dhar, S., Bhattacharjee, S. and Das, A. (2013). A lexicon based algorithm for noisy text normalization as pre-processing for sentiment analysis. International Journal of Research in Engineering and Technology 02, 6770.Google Scholar
Sajjad, H., Fraser, A. and Schmid, H. (2011). An algorithm for unsupervised transliteration mining with an application to word alignment. In Proceedings of the 49th Conference of the Association for Computational Linguistics – Human Language Technologies, Portland, Oregon, USA.Google Scholar
Sajjad, H. and Schmid, H. (2009). Tagging Urdu text with parts of speech: a tagger comparison. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, Athens, Greece, pp. 692700.CrossRefGoogle Scholar
Sajjad, H., Schmid, H., Fraser, A. and Schütze, H. (2017). Statistical models for unsupervised, semi-supervised and supervised transliteration mining. Computational Linguistics 43(2), 349375.CrossRefGoogle Scholar
Sidarenka, U., Scheffler, T. and Stede, M. (2013). Rule-based normalization of German twitter messages. In Proceedings of the Gesellschaft fur Sprachtechnologie und Computerlinguistik Workshop Verarbeitung und Annotation von Sprachdaten aus Genres internetbasierter Kommunikation, Darmstadt, Germany.Google Scholar
Singh, R., Choudhary, N. and Shrivastava, M. (2018). Automatic normalization of word variations in code-mixed social media text. Computing Research Repository, .Google Scholar
Sproat, R., Black, A.W., Chen, S.F., Kumar, S., Ostendorf, M. and Richards, C. (2001). Normalization of non-standard words. Computer Speech & Language 15(3), 287333.CrossRefGoogle Scholar
Sproat, R. and Jaitly, N. (2017). An RNN model of text normalization. In Proceedings of Interspeech, Stockholm, Sweden.CrossRefGoogle Scholar
Taft, R. (1970). Name search techniques. Special report. Bureau of Systems Development, New York State Identification and Intelligence System.Google Scholar
Vilain, M., Burger, J., Aberdeen, J., Connolly, D. and Hirschman, L. (1995). A model-theoretic coreference scoring scheme. In Proceedings of the 6th Message Understanding Conference, Columbia, Maryland, USA, pp. 4552.CrossRefGoogle Scholar
Wang, J. (ed) (2009). Encyclopedia of Data Warehousing and Mining, 2nd Edn. (4 Volumes). Hershey, Pennsylvania, USA, IGI Global.CrossRefGoogle Scholar
Wang, P. and Ng, H.T. (2013). A beam-search decoder for normalization of social media text with application to machine translation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics – Human Language Technologies, Atlanta, Georgia, USA. Association for Computational Linguistics, pp. 471481.Google Scholar
Wei, Z., Zhou, L., Li, B., Wong, K.-F., Gao, W. and Wong, K.-F. (2011). Exploring tweets normalization and query time sensitivity for twitter search. In Proceedings of the 20th Text Retrieval Conference, Gaithersburg, Maryland, USA.Google Scholar
Yang, Y. and Eisenstein, J. (2013). A log-linear model for unsupervised text normalization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing: Meeting of SIGDAT, a Special Interest Group of the Association for Computational Linguistics, Seattle, Washington, USA, pp. 6172.Google Scholar
Zhang, X., Song, J., He, Y. and Fu, G. (2015). Normalization of homophonic words in Chinese microblogs. In Intelligent Computation in Big Data Era. Springer, pp. 177187.CrossRefGoogle Scholar