SwitchNet: Learning to switch for word-level language identification in code-mixed social media text

Neelakshi Sarma; Ranbir Sanasam Singh; Diganta Goswami

doi:10.1017/S1351324921000115

SwitchNet: Learning to switch for word-level language identification in code-mixed social media text

Published online by Cambridge University Press: 03 June 2021

Neelakshi Sarma

Ranbir Sanasam Singh and

Diganta Goswami

Show author details

Neelakshi Sarma*: Affiliation:
Department of Computer Science and Engineering, Indian Institute of Technology Guwahati, Guwahati, 781039, India
Ranbir Sanasam Singh: Affiliation:
Department of Computer Science and Engineering, Indian Institute of Technology Guwahati, Guwahati, 781039, India
Diganta Goswami: Affiliation:
Department of Computer Science and Engineering, Indian Institute of Technology Guwahati, Guwahati, 781039, India
*: *Corresponding author. E-mail: s.neelakshi@iitg.ac.in

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Word-level language identification is an essential prerequisite for extracting useful information from code-mixed social media content. Previous studies in word-level language identification show two important observations. First, the local context is an important indicator of the language of a word when a word is valid in multiple languages. Second, considering the word in isolation from its context leads to more effective language classification when a word is borrowed or embedded into sentences of other languages. In this paper, we propose a framework for language identification that makes use of a dynamic switching mechanism for effective language classification of both words that are borrowed or embedded from other languages as well as words that are valid in multiple languages. For a given input, the proposed switching mechanism makes a dynamic decision to bias its prediction either towards the prediction obtained by the contextual information or that obtained by the word in isolation. In contrast to existing studies that rely upon large amounts of annotated data for robust performance in a multilingual environment, the proposed approach uses minimal annotated resources and no external resources, making it easily extendible to newer languages. Evaluation over a corpus of transliterated Facebook comments shows that the proposed approach outperforms its baseline counterparts: classification based on the contextual information, classification based on the word in isolation, as well as an ensemble of the two classifiers.

Keywords

Language identification Code-mixing Social media text Multilingual

Type: Article
Information: Natural Language Engineering , Volume 28 , Issue 3 , May 2022 , pp. 337 - 359

DOI: https://doi.org/10.1017/S1351324921000115 [Opens in a new window]
Copyright: © The Author(s), 2021. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abainia, K., Ouamour, S. and Sayoud, H. (2016) Effective language identification of forum texts based on statistical approaches. Information Processing & Management 52, 491–512.CrossRef Google Scholar

Banerjee, S., Kuila, A., Roy, A., Naskar, S.K., Rosso, P. and Bandyopadhyay, S. (2014) A hybrid approach for transliterated word-level language identification: Crf with post-processing heuristics. In Proceedings of the 2014 Forum for Information Retrieval Evaluation, Bangalore, India, pp. 54–59.Google Scholar

Barman, U., Das, A., Wagner, J. and Foster, J. (2014) Code mixing: A challenge for language identification in the language of social media. In Proceedings of The First Workshop on Computational Approaches to Code Switching, Doha, Qatar, pp. 13–23.CrossRef Google Scholar

Bock, Z. (2013) Cyber socialising: Emerging genres and registers of intimacy among young south african students. Language Matters 44, 68–91.CrossRef Google Scholar

Bullock, B., GuzmÑn, W., Serigos, J., Sharath, V. and Toribio, A.J. (2018) Predicting the presence of a matrix language in code-switching. In Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, Melbourne, Australia, pp. 68–75.CrossRef Google Scholar

Carter, S., Weerkamp, W. and Tsagkias, M. (2013) Microblog language identification: Overcoming the limitations of short, unedited and idiomatic text. Language Resources and Evaluation Journal 47(1), 195–215.CrossRef Google Scholar

Cavnar, W.B. and Trenkle, J.M. (1994) N-gram-based text categorization. In Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, USA, pp. 161–175.Google Scholar

Chandu, K., Manzini, T., Singh, S. and Black, A.W. (2018) Language informed modeling of code-switched text. In Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, New Orleans, Louisiana, pp. 92–97.CrossRef Google Scholar

Chittaranjan, G., Vyas, Y., Bali, K. and Choudhury, M. (2014) Word-level language identification using CRF: Code-switching shared task report of MSR india system. In Proceedings of The First Workshop on Computational Approaches to Code Switching, Doha, Qatar, pp. 73–79.CrossRef Google Scholar

Das, A. and Gamback, B. (2014) Identifying languages at the word level in code-mixed Indian social media text. In Proceedings of the International Conference on Natural Language Processing, Goa, India, pp. 378–387.Google Scholar

Das, S.D., Mandal, S. and Das, D. 2019. Language identification of Bengali-English code-mixed data using character & phonetic based LSTM models. In Proceedings of the Forum for Information Retrieval Evaluation, Kolkata, India, pp. 60–64.Google Scholar

Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, Minnesota, pp. 4171–4186.Google Scholar

Garg, A., Gupta, V. and Jindal, M. (2014) A survey of language identification techniques and applications. Journal of Emerging Technologies in Web Intelligence 6, 388–400.Google Scholar

Gella, S., Bali, K. and Choudhury, M. (2014) ye word kis lang ka hai bhai? Testing the limits of word level language identification. In Proceedings of the International Conference on Natural Language Processing, Goa, India, pp. 368–377.Google Scholar

Gundapu, S. and Mamidi, R. (2018) Word level language identification in english telugu code mixed data. In Proceedings of the Pacific Asia Conference on Language, Information and Computation, Hong Kong, pp. 180–186.Google Scholar

Jaech, A., Mulcaire, G., Hathi, S., Ostendorf, M. and Smith, N.A. (2016) Hierarchical character-word models for language identification. In Proceedings of The Fourth International Workshop on Natural Language Processing for Social Media, Austin, TX, pp. 84–93.CrossRef Google Scholar

Jauhiainen, T.S., Lui, M., Zampieri, M., Baldwin, T. and LindÉn, K. (2019) Automatic language identification in texts: A survey. Journal of Artificial Intelligence Research 65, 675–782.CrossRef Google Scholar

Jurgens, D., Tsvetkov, Y. and Jurafsky, D. (2017) Incorporating dialectal variability for socially equitable language identification. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, pp. 51–57.CrossRef Google Scholar

King, B. and Abney, S. (2013) Labeling the languages of words in mixed-language documents using weakly supervised methods. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, pp. 1110–1119.Google Scholar

Mager, M., Cetinoglu, O. and Kann, K. (2019) Subword-level language identification for intra-word code-switching. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota, pp. 2005–2011.CrossRef Google Scholar

Mandal, S. and Singh, A.K. (2018) Language identification in code-mixed data using multichannel neural networks and context capture. In Proceedings of The Fourth Workshop on Noisy User-generated Text, Brussels, Belgium, pp. 116–120.CrossRef Google Scholar

Mave, D., Maharjan, S. and Solorio, T. (2018) Language identification and analysis of code-switched social media text. In Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, Melbourne, Australia.CrossRef Google Scholar

Miyamoto, Y. and Cho, K. (2016) Gated word-character recurrent language model. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Texas, USA, pp. 1992–1997.CrossRef Google Scholar

Molina, G., Rey-Villamizar, N., Solorio, T., AlGhamdi, F., Ghoneim, M., Hawwari, A. and Diab, M. (2016) Overview for the second shared task on language identification in code-switched data. In Proceedings of the Second Workshop on Computational Approaches to Code Switching, Texas, USA, pp. 40–49.CrossRef Google Scholar

Nguyen, D. and Cornips, L. (2016) Automatic detection of intra-word code-switching. In Proceedings of the Fourteenth SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, Berlin, Germany, pp. 82–86.CrossRef Google Scholar

Nguyen, D. and Doğruöz, A.S. (2013) Word level language identification in online multilingual communication. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Washington, USA, pp. 857–862.Google Scholar

Papalexakis, E., Nguyen, D. and Doğruöz, A.S. (2014) Predicting code-switching in multilingual communication for immigrant communities. In Proceedings of the First Workshop on Computational Approaches to Code Switching, Doha, Qatar, pp. 42–50.CrossRef Google Scholar

Patro, J., Samanta, B., Singh, S., Basu, A., Mukherjee, P., Choudhury, M. and Mukherjee, A. (2017) All that is English may be Hindi: Enhancing language identification through automatic ranking of the likeliness of word borrowing in social media. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 2264–2274.CrossRef Google Scholar

Piergallini, M., Shirvani, R., Gautam, G.S. and Chouikha, M. (2016) Word-level language identification and predicting codeswitching points in swahili-english language data. In Proceedings of the Second Workshop on Computational Approaches to Code Switching, Texas, USA, pp. 21–29.CrossRef Google Scholar

Rei, M., Crichton, G. and Pyysalo, S. (2016). Attending to characters in neural sequence labeling models. In Proceedings of the 26th International Conference on Computational Linguistics, Osaka, Japan, pp. 309–318.Google Scholar

Rijhwani, S., Sequiera, R., Choudhury, M., Bali, K. and Maddila, C.S. (2017) Estimating code-switching on twitter with a novel generalized word-level language detection technique. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, pp. 1971–1982.CrossRef Google Scholar

Rudra, K., Sharma, A., Bali, K., Choudhury, M. and Ganguly, N. (2019) Identifying and analyzing different aspects of English-Hindi code-switching in Twitter. ACM Transactions on Asian and Low-Resource Language Information Processing 18.CrossRef Google Scholar

Samih, Y., Maharjan, S., Attia, M., Kallmeyer, L. and Solorio, T. (2016) Multilingual code-switching identification via LSTM recurrent neural networks. In Proceedings of the Second Workshop on Computational Approaches to Code Switching, Texas, USA, pp. 50–59.CrossRef Google Scholar

Sarma, N., Sanasam, R. and Goswami, D. (2019) Influence of social conversational features on language identification in highly multilingual online conversations. Information Processing & Management 56, 151–166.CrossRef Google Scholar

Sarma, N., Singh, S.R. and Goswami, D. (2018) Word level language identification in Assamese-Bengali-Hindi-English code-mixed social media text. In Proceedings of the 2018 International Conference on Asian Language Processing (IALP), Bandung, Indonesia, pp. 261–266.CrossRef Google Scholar

Sikdar, U.K. and Gambäck, B. (2016) Language identification in code-switched text using conditional random fields and babelnet. In Proceedings of the Second Workshop on Computational Approaches to Code Switching, Texas, USA, pp. 127–131.CrossRef Google Scholar

Singh, K., Sen, I. and Kumaraguru, P. (2018) A Twitter corpus for Hindi-English code mixed pos tagging. In Proceedings of the Sixth International Workshop on Natural Language Processing for Social Media, Melbourne, Australia, pp. 12–17.CrossRef Google Scholar

Solorio, T., Blair, E., Maharjan, S., Bethard, S., Diab, M., Ghoneim, M., Hawwari, A., AlGhamdi, F., Hirschberg, J. and Chang, A. (2014) Overview for the first shared task on language identification in code-switched data. In Proceedings of the First Workshop on Computational Approaches to Code Switching, Doha, Qatar, pp. 62–72.CrossRef Google Scholar

Volkova, S., Ranshous, S. and Phillips, L. (2018) Predicting foreign language usage from English-only social media posts. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, pp. 608–614.CrossRef Google Scholar

Vyas, Y., Gella, S., Sharma, J., Bali, K. and Choudhury, M. (2014) Pos tagging of English-Hindi code-mixed social media content. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, pp. 974–979.CrossRef Google Scholar

Wang, P., Bojja, N. and Kannan, S. (2015) A language detection system for short chats in mobile games. In Proceedings of the International Workshop on Natural Language Processing for Social Media, Denver, Colorado, pp. 20–28.CrossRef Google Scholar

Xia, M.X. (2016) Codeswitching language identification using subword information enriched word vectors. In Proceedings of The Second Workshop on Computational Approaches to Code Switching, Texas, USA, pp. 132–136.CrossRef Google Scholar

Yang, X. and Liang, W. (2010) An n-gram-and-wikipedia joint approach to natural language identification. In Proceedings of the International Universal Communication Symposium, Beijing, China, pp. 332–339.CrossRef Google Scholar

Yip, V. and Matthews, S. 2016. Code-mixing and mixed verbs in Cantonese-English bilingual children: Input and innovation. Languages, MDPI, 1(1):4–18.CrossRef Google Scholar

Zhang, Y., Riesa, J., Gillick, D., Bakalov, A., Baldridge, J. and Weiss, D. (2018) A fast, compact, accurate model for language identification of codemixed text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 328–337.CrossRef Google Scholar

Zubiaga, A., San Vicente, I., Gamallo, P., Pichel, J.R., Alegria, I., Aranberri, N., Ezeiza, A. and Fresno, V. (2016) Tweetlid: A benchmark for tweet language identification. Language Resources and Evaluation Journal 50, 729–766.CrossRef Google Scholar

Article contents

SwitchNet: Learning to switch for word-level language identification in code-mixed social media text

Abstract

Keywords

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests