Creation of annotated country-level dialectal Arabic resources: An unsupervised approach

Maha J. Althobaiti

doi:10.1017/S135132492100019X

Creation of annotated country-level dialectal Arabic resources: An unsupervised approach

Published online by Cambridge University Press: 09 August 2021

Maha J. Althobaiti

Show author details

Maha J. Althobaiti*: Affiliation:
Department of Computer Science, College of Computers and Information Technology, Taif University, Taif 21944, Saudi Arabia
*: Corresponding author. E-mail: maha.j@tu.edu.sa

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

The wide usage of multiple spoken Arabic dialects on social networking sites stimulates increasing interest in Natural Language Processing (NLP) for dialectal Arabic (DA). Arabic dialects represent true linguistic diversity and differ from modern standard Arabic (MSA). In fact, the complexity and variety of these dialects make it insufficient to build one NLP system that is suitable for all of them. In comparison with MSA, the available datasets for various dialects are generally limited in terms of size, genre and scope. In this article, we present a novel approach that automatically develops an annotated country-level dialectal Arabic corpus and builds lists of words that encompass 15 Arabic dialects. The algorithm uses an iterative procedure consisting of two main components: automatic creation of lists for dialectal words and automatic creation of annotated Arabic dialect identification corpus. To our knowledge, our study is the first of its kind to examine and analyse the poor performance of the MSA part-of-speech tagger on dialectal Arabic contents and to exploit that in order to extract the dialectal words. The pointwise mutual information association measure and the geographical frequency of word occurrence online are used to classify dialectal words. The annotated dialectal Arabic corpus (Twt15DA), built using our algorithm, is collected from Twitter and consists of 311,785 tweets containing 3,858,459 words in total. We randomly selected a sample of 75 tweets per country, 1125 tweets in total, and conducted a manual dialect identification task by native speakers. The results show an average inter-annotator agreement score equal to 64%, which reflects satisfactory agreement considering the overlapping features of the 15 Arabic dialects.

Keywords

Arabic dialect identification (ADI)Unsupervised approach Pointwise mutual information (PMI)Automatic corpus construction Text mining

Type: Article
Information: Natural Language Engineering , Volume 28 , Issue 5 , September 2022 , pp. 607 - 648

DOI: https://doi.org/10.1017/S135132492100019X [Opens in a new window]
Copyright: © The Author(s), 2021. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abdul-Mageed, M., Alhuzali, H. and Elaraby, M. (2018). You tweet what you speak: a city-level dataset of Arabic dialects. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC’18), Miyazaki, Japan. European Language Resources Association, pp. 3653–3659.Google Scholar

Abdul-Mageed, M. and Diab, M. (2012). Toward building a large-scale Arabic sentiment lexicon. In Proceedings of the 6th International Global WordNet Conference, Matsue, Japan. Global WordNet Association, pp. 18–22.Google Scholar

Abdul-Mageed, M. and Diab, M.T. (2014). SANA: a large scale multi-genre, multi-dialect lexicon for Arabic subjectivity and sentiment analysis. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland. European Language Resources Association, pp. 1162–1169.Google Scholar

Abdul-Mageed, M., Zhang, C., Bouamor, H. and Habash, N. (2020). NADI 2020: the first nuanced Arabic dialect identification shared task. In Proceedings of the Fifth Arabic Natural Language Processing Workshop, Barcelona, Spain (Online). Association for Computational Linguistics, pp. 97–110.Google Scholar

Abidi, K., Menacer, M.-A. and Smaili, K. (2017). CALYOU: a comparable spoken Algerian corpus harvested from youtube. In Proceedings of the 18th Annual Conference of the International Communication Association (Interspeech), Stockholm, Sweden. International Speech Communication Association.CrossRef Google Scholar

Abouenour, L., Bouzoubaa, K. and Rosso, P. (2013). On the evaluation and improvement of Arabic WordNet coverage and usability. Language Resources and Evaluation 47(3), 891–917.CrossRef Google Scholar

Abu Kwaik, K., Saad, M.K., Chatzikyriakidis, S. and Dobnik, S. (2018). Shami: a corpus of levantine Arabic dialects. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC’18, Miyazaki, Japan. European Language Resources Association, pp. 3645–3652.Google Scholar

Al-Sabbagh, R. and Girju, R. (2012). YADAC: Yet another Dialectal Arabic Corpus. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey. European Language Resources Association, pp. 2882–2889.Google Scholar

Al-Shargi, F., Kaplan, A., Eskander, R., Habash, N. and Rambow, O. (2016). Morphologically annotated corpora and morphological analyzers for Moroccan and Sanaani Yemeni Arabic. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia. European Language Resources Association, pp. 1300–1306.Google Scholar

Al-Shargi, F. and Rambow, O. (2015). DIWAN: a dialectal word annotation tool for Arabic. In Proceedings of the Second Workshop on Arabic Natural Language Processing, Beijing, China. Association for Computational Linguistics, pp. 49–58.CrossRef Google Scholar

Al-Twairesh, N., Al-Khalifa, H., Al-Salman, A. and Al-Ohali, Y. (2017). AraSenTi-Tweet: a corpus for Arabic sentiment analysis of Saudi tweets. Procedia Computer Science 117, 63–72.CrossRef Google Scholar

Almeman, K. and Lee, M. (2013). Automatic building of Arabic multi dialect text corpora by bootstrapping dialect words. In Proceedings of the 1st International Conference on Communications, Signal Processing, and their Applications (ICCSPA 2013), Sharjah, United Arab Emirates. Institute of Electrical and Electronics Engineers (IEEE), pp. 1–6.CrossRef Google Scholar

Aloraini, A., Poesio, M. and Alhelbawy, A. (2020). The QMUL/HRBDT contribution to the nadi arabic dialect identification shared task. In Proceedings of the Fifth Arabic Natural Language Processing Workshop, Barcelona, Spain (Online). Association for Computational Linguistics, pp. 295–301.Google Scholar

Alorini, D. and Rawat, D.B. (2019). Automatic spam detection on Gulf dialectical Arabic tweets. In International Conference on Computing, Networking and Communications (ICNC), Honolulu, HI, USA. IEEE, pp. 448–452.CrossRef Google Scholar

Alsarsour, I., Mohamed, E., Suwaileh, R. and Elsayed, T. (2018). Dart: a large dataset of dialectal Arabic tweets. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC’18), Miyazaki, Japan. European Language Resources Association, pp. 3666–3670.Google Scholar

Alshutayri, A. and Atwell, E. (2017). Exploring Twitter as a source of an Arabic dialect corpus. International Journal of Computational Linguistics (IJCL) 8(2), 37–44.Google Scholar

Althobaiti, M., Kruschwitz, U. and Poesio, M. (2014a). Aranlp: a java-based library for the processing of arabic text. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland. European Language Resources Association, pp. 4134–4138.Google Scholar

Althobaiti, M., Kruschwitz, U. and Poesio, M. (2014b). Automatic creation of Arabic named entity annotated corpus using wikipedia. In Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics, Gothenburg, Sweden. Association for Computational Linguistics, pp. 106–115.CrossRef Google Scholar

Althobaiti, M.J. (2020a). Automatic Arabic dialect identification systems for written texts: a survey. International Journal of Computational Linguistics (IJCL) 11(3), 61–89.Google Scholar

Althobaiti, M.J. (2020b). Automatic Arabic dialect identification systems for written texts: a survey. arXiv preprint arXiv:2009.12622.Google Scholar

Ameur, H., Jamoussi, S. and Hamadou, A.B. (2016). Exploiting emoticons to generate emotional dictionaries from facebook pages. In Intelligent Decision Technologies 2016. Cham: Springer, pp. 39–49.CrossRef Google Scholar

Azoulay, A. (2017). World Arabic Language Day 2017. Available at https://en.unesco.org/world-arabic-language-day (accessed 10 September 2018).Google Scholar

Barkat, M., Ohala, J. and Pellegrino, F. (1999). Prosody as a distinctive feature for the discrimination of Arabic dialects. In Sixth European Conference on Speech Communication and Technology (EUROSPEECH’99), Budapest, Hungary. International Speech Communication Association (ISCA).CrossRef Google Scholar

Benajiba, Y., Diab, M. and Rosso, P. (2008). Arabic named entity recognition using optimized feature sets. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Waikiki, Honolulu, Hawaii. Association for Computational Linguistics, pp. 284–293.CrossRef Google Scholar

Black, W., Elkateb, S., Rodriguez, H., Alkhalifa, M., Vossen, P., Pease, A. and Fellbaum, C. (2006). Introducing the Arabic wordnet project. In Proceedings of the Third International WordNet Conference, Seogwipo, South Jeju Island, Korea. Global WordNet Association, pp. 295–300.Google Scholar

Bouamor, H., Habash, N. and Oflazer, K. (2014). A multidialectal parallel corpus of Arabic. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland. European Language Resources Association, pp. 1240–1245.Google Scholar

Bouamor, H., Habash, N., Salameh, M., Zaghouani, W., Rambow, O., Abdulrahim, D., Obeid, O., Khalifa, S., Eryani, F., Erdmann, A. and Oflazer, K. (2018). The MADAR Arabic dialect corpus and lexicon. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC’18), Miyazaki, Japan. European Language Resources Association, pp. 3387–3396.Google Scholar

Bouamor, H., Hassan, S. and Habash, N. (2019). The MADAR shared task on Arabic fine-grained dialect identification. In Proceedings of the Fourth Arabic Natural Language Processing Workshop, Florence, Italy. Association for Computational Linguistics, pp. 199–207.CrossRef Google Scholar

Bouchlaghem, R., Elkhlifi, A. and Faiz, R. (2014). Tunisian dialect Wordnet creation and enrichment using web resources and other Wordnets. In Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP), Doha, Qatar. Association for Computational Linguistics, pp. 104–113.CrossRef Google Scholar

Boujelbane, R., Khemekhem, M.E., BenAyed, S. and Belguith, L.H. (2013). Building bilingual lexicon to create Dialect Tunisian corpora and adapt language model. In Proceedings of the Second Workshop on Hybrid Approaches to Translation, Sofia, Bulgaria. Association for Computational Linguistics, pp. 88–93.Google Scholar

Cotterell, R. and Callison-Burch, C. (2014). A multi-dialect, multi-genre corpus of informal written Arabic. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland. European Language Resources Association, pp. 241–245.Google Scholar

Darwish, K. (2014). Arabizi detection and conversion to Arabic. In Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP), Doha, Qatar. Association for Computational Linguistics, pp. 217–224.CrossRef Google Scholar

Darwish, K., Mubarak, H., Abdelali, A., Eldesouki, M., Samih, Y., Alharbi, R., Attia, M., Magdy, W. and Kallmeyer, L. (2018). Multi-dialect Arabic POS tagging: a CRF approach. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association, pp. 93–98.Google Scholar

Diab, M. and Habash, N. (2007). Arabic dialect processing tutorial. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Tutorial Abstracts, Rochester, New York. Association for Computational Linguistics, pp. 5–6.CrossRef Google Scholar

Diab, M., Habash, N., Rambow, O., Altantawy, M. and Benajiba, Y. (2010). COLABA: Arabic dialect annotation and processing. In Proceedings of the LREC workshop on Semitic language processing, Malta. European Language Resources Association, pp. 66–74.Google Scholar

Diab, M.T., Al-Badrashiny, M., Aminian, M., Attia, M., Elfardy, H., Habash, N., Hawwari, A., Salloum, W., Dasigi, P. and Eskander, R. (2014). Tharwa: a large scale dialectal Arabic-Standard Arabic-English lexicon. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland. European Language Resources Association, pp. 3782–3789.Google Scholar

Duh, K. and Kirchhoff, K. (2005). POS tagging of dialectal Arabic: a minimally supervised approach. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, Ann Arbor, Michigan. Association for Computational Linguistics, pp. 55–62.CrossRef Google Scholar

Eberhard, D.M., Simons, G.F. and Fennig, C.D. (2019). Ethnologue: Languages of the World. Twenty-second edition. Dallas, Texas: SIL International. Available at http://www.ethnologue.com.Google Scholar

El-Beltagy, S.R. (2016). NileULex: a phrase and word level sentiment lexicon for Egyptian and modern standard Arabic. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portoro, Slovenia. European Language Resources Association, pp. 2900–2905.Google Scholar

El-Beltagy, S.R. (2017). WeightedNileULex: A Scored Arabic Sentiment Lexicon for Improved Sentiment Analysis, vol. 4. World Scientific.Google Scholar

Eldesouki, M., Samih, Y., Abdelali, A., Attia, M., Mubarak, H., Darwish, K. and Laura, K. (2017). Arabic Multi-Dialect Segmentation: bi-LSTM-CRF vs. SVM. arXiv preprint arXiv:1708.05891.Google Scholar

Elfardy, H. and Diab, M. (2012). Token level identification of linguistic code switching. In Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012), Mumbai, India. The COLING 2012 Organizing Committee, pp. 287–296.Google Scholar

Elgabou, H.A. and Kazakov, D. (2017). Building dialectal Arabic corpora. In Proceedings of the First Workshop on Human-Informed Translation and Interpreting Technology (HiT-IT). Association for Computational Linguistics, pp. 52–57.CrossRef Google Scholar

Farid, D. and El-Tazi, N. (2020). Detection of cyberbullying in tweets in Egyptian dialects. International Journal of Computer Science and Information Security (IJCSIS) 18(7), 34–41.Google Scholar

Ferguson, C.A. (1959). Diglossia. Word 15(2), 325–340.CrossRef Google Scholar

Gadalla, H., Kilany, H., Arram, H., Yacoub, A., El-Habashi, A., Shalaby, A., Karins, K., Rowson, E., Mac-Intyre, R., Kingsbury, P., Graff, D. and McLemore, C. (1997). CALLHOME Egyptian Arabic Transcripts LDC97T19. (Web download). Philadelphia, Linguistic Data Consortium. Available at https://catalog.ldc.upenn.edu/LDC97T19.Google Scholar

Ghazali, S., Hamdi, R. and Barkat, M. (2002). Speech rhythm variation in Arabic dialects. In Speech Prosody 2002, International Conference, Aix-en-Provence, France. International Speech Communication Association (ISCA).Google Scholar

Habash, N., Diab, M.T. and Rambow, O. (2012a). Conventional orthography for aialectal Arabic. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey. European Language Resources Association, pp. 711–718.Google Scholar

Habash, N., Eryani, F., Khalifa, S., Rambow, O., Abdulrahim, D., Erdmann, A., Faraj, R., Zaghouani, W., Bouamor, H., Zalmout, N., Hassan, S., Al-Shargi, F., Alkhereyf, S., Abdulkareem, B., Eskander, R., Salameh, M. and Saddiki, H. (2018). Unified guidelines and resources for Arabic dialect orthography. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC’18), Miyazaki, Japan. European Language Resources Association, pp. 3628–3637.Google Scholar

Habash, N., Eskander, R. and Hawwari, A. (2012b). A morphological analyzer for Egyptian Arabic. In Proceedings of the Twelfth Meeting of the Special Interest Group on Computational Morphology and Phonology, MontrÃ©al, Canada. Association for Computational Linguistics, pp. 1–9.Google Scholar

Habash, N., Roth, R., Rambow, O., Eskander, R. and Tomeh, N. (2013). Morphological analysis and disambiguation for dialectal Arabic. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, Georgia. Association for Computational Linguistics, pp. 426–432.Google Scholar

Habash, N., Soudi, A. and Buckwalter, T. (2007). On Arabic transliteration. In Soudi, A., van den Bosch, A. and Neumann, G. (eds), Arabic Computational Morphology: Knowledge-based and Empirical Methods. Netherlands: Springer, pp. 15–22.CrossRef Google Scholar

Habash, N.Y. (2010). Introduction to Arabic natural language processing. Synthesis Lectures on Human Language Technologies 3(1), 1–187.CrossRef Google Scholar

Hamdi, A., Shaban, K. and Zainal, A. (2018). Clasenti: a class-specific sentiment analysis framework. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 17(4), 1–28.CrossRef Google Scholar

Hammarström, H., Forkel, R. and Haspelmath, M. (2018). Glottolog 3.3 - Catalogue of Languages and Families. Jena, Germany: Max Planck Institute for the Science of Human History. Available at https://glottolog.org/.Google Scholar

Harrat, S., Abbas, M., Meftouh, K. and Smaili, K. (2013). Diacritics restoration for Arabic dialects. In Proceedings of 14th Annual Conference of the International Communication Association (INTERSPEECH 2013), Lyon, France. International Speech Communication Association (ISCA), pp. 125–132.CrossRef Google Scholar

Harrat, S., Meftouh, K., Abbas, M., Jamoussi, S., Saad, M. and Smaili, K. (2015). Cross-dialectal Arabic processing. In 16th International Conference on Intelligent Text Processing and Computational Linguistics, Cairo, Egypt. Springer, pp. 620–632.CrossRef Google Scholar

Hetzron, R. (2005). The Semitic Languages. London and New York: Routledge, Taylor & Francis Group.Google Scholar

Holes, C. (2004). Modern Arabic: Structures, Functions, and Varieties. Washington, DC: Georgetown University Press.Google Scholar

Huang, F. (2015). Improved Arabic dialect classification with social media data. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal. Association for Computational Linguistics, pp. 2118–2126.CrossRef Google Scholar

Ibrahim, H.S., Abdou, S.M. and Gheith, M. (2015). MIKA: a tagged corpus for modern standard Arabic and colloquial sentiment analysis. In RETIS 2015: 2nd IEEE International Conference on Recent Trends in Information Systems, Jadavpur University, Kolkata. IEEE, pp. 353–358.CrossRef Google Scholar

Isola, P., Zoran, D., Krishnan, D. and Adelson, E.H. (2014). Crisp boundary detection using pointwise mutual information. In Proceedings of the 13th European Conference on Computer Vision (ECCV’14), Zurich, Switzerland. Springer, pp. 799–814.CrossRef Google Scholar

Jarrar, M., Habash, N., Alrimawi, F., Akra, D. and Zalmout, N. (2017). Curras: an annotated corpus for the Palestinian Arabic dialect. Language Resources and Evaluation 51(3), 745–775.CrossRef Google Scholar

Jehl, L., Hieber, F. and Riezler, S. (2012). Twitter translation using translation-based cross-lingual retrieval. In Proceedings of the Seventh Workshop on Statistical Machine Translation, Montral, Canada. Association for Computational Linguistics, pp. 410–421.Google Scholar

Kanan, T., Aldaaja, A. and Hawashin, B. (2020). Cyber-bullying and cyber-harassment detection using supervised machine learning techniques in Arabic social media contents. Journal of Internet Technology 21(5), 1409–1421.Google Scholar

Kaye, A.S. and Rosenhouse, J. (2005). Arabic Dialects and Maltese. In Hetzron, R. (ed.), The Semitic Languages, Chapter 14. London and New York: Routledge, pp. 263–311.Google Scholar

Khalifa, S., Habash, N., Abdulrahim, D. and Hassan, S. (2016). A large scale corpus of Gulf Arabic. In Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016, Portoro, Slovenia. European Language Resources Association (ELRA), pp. 4282–4289.Google Scholar

Khalifa, S., Habash, N., Eryani, F., Obeid, O., Abdulrahim, D. and Al Kaabi, M. (2018). A morphologically annotated corpus of Emirati Arabic. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC’18), Miyazaki, Japan. European Language Resources Association, pp. 3839–3846.Google Scholar

Landis, J.R. and Koch, G.G. (1977). The measurement of observer agreement for categorical data. Biometrics 33(1), 159–174.CrossRef Google Scholar PubMed

Maamouri, M., Bies, A., Buckwalter, T., Diab, M.T., Habash, N., Rambow, O. and Tabessi, D. (2006). Developing and using a pilot dialectal Arabic treebank. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06), Genoa, Italy. European Language Resources Association, pp. 443–448.Google Scholar

Maamouri, M., Bies, A., Buckwalter, T. and Mekki, W. (2004). The penn Arabic treebank: building a large-scale annotated Arabic corpus. In NEMLAR: International Conference on Arabic Language Resources and Tools, Cairo, Egypt. ELDA, pp. 466–467.Google Scholar

Maamouri, M., Bies, A., Kulick, S., Ciul, M., Habash, N. and Eskander, R. (2014). Developing an Egyptian Arabic treebank: impact of dialectal morphology on annotation and tool development. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland. European Language Resources Association, pp. 2348–2354.Google Scholar

Maamouri, M., Bies, A., Kulick, S., Krouna, S., Tabassi, D. and Ciul, M. (2018). BOLT Egyptian Arabic Treebank - Discussion Forum LDC2018T23. Web Download. Philadelphia: Linguistic Data Consortium, University of Pennsylvania.Google Scholar

Maamouri, M., Bies, A., Kulick, S., Tabessi, D. and Krouna, S. (2012). Egyptian Arabic Treebank - Discussion Forum, Parts 1-8 V2.0. LDC Catalog Numbers: LDC2012E93, LDC2012E98, LDC2012E89, LDC2012E99, LDC2012E107, LDC2012E125, LDC2013E12, LDC2013E21. Web Download. Philadelphia: Linguistic Data Consortium, University of Pennsylvania.Google Scholar

Malmasi, S., Zampieri, M., Ljubešić, N., Nakov, P., Ali, A. and Tiedemann, J. (2016). Discriminating between similar languages and Arabic dialect identification: a report on the third DSL shared task. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial6), Osaka, Japan. The COLING 2016 Organizing Committee, pp. 1–14.Google Scholar

Manning, C.D. and Schütze, H. (1999). Foundations of Statistical Natural Language Processing. Cambridge, MA, London, UK: MIT Press.Google Scholar

McCreadie, R., Soboroff, I., Lin, J., Macdonald, C., Ounis, I. and McCullough, D. (2012). On building a reusable Twitter corpus. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, Portland, Oregon, USA. Association for Computing Machinery, pp. 1113–1114.CrossRef Google Scholar

McNeil, K. and Faiza, M. (2011). Tunisian Arabic corpus: creating a written corpus of an unwritten language. In the 1st Workshop on Arabic Corpus Linguistics (WACL), Lancaster, UK. Lancaster University.Google Scholar

Medhaffar, S., Bougares, F., Estève, Y. and Hadrich-Belguith, L. (2017). Sentiment analysis of Tunisian dialects: linguistic ressources and experiments. In Proceedings of the third Arabic natural language processing workshop, Valencia, Spain. Association for Computational Linguistics, pp. 55–61.CrossRef Google Scholar

Meftouh, K., Harrat, S., Jamoussi, S., Abbas, M. and Smaili, K. (2015). Machine translation experiments on PADIC: a parallel Arabic dialect corpus. In Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation, Shanghai, China. Association for Computational Linguistics, pp. 26–34.Google Scholar

Mubarak, H. (2018). Dial2msa: a tweets corpus for converting dialectal arabic to modern standard arabic. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), OSACT2018 Workshop, Miyazaki, Japan. European Language Resources Association, pp. 49–53.Google Scholar

Mubarak, H. and Darwish, K. (2014). Using Twitter to collect a multi-dialectal corpus of Arabic. In Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP), Doha, Qatar. Association for Computational Linguistics, pp. 1–7.CrossRef Google Scholar

Pasha, A., Al-Badrashiny, M., Diab, M. T., El Kholy, A., Eskander, R., Habash, N., Pooleery, M., Rambow, O. and Roth, R. (2014). Madamira: a fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland. European Language Resources Association, pp. 1094–1101.Google Scholar

Roesslein, J. (2020). tweepy Documentation. Available at https://docs.tweepy.org.Google Scholar

Rosso, P., Rangel Pardo, F.M., Ghanem, B. and Charfi, A. (2018). ARAP: Arabic author profiling project for cyber-security. 61, 135–138. Sociedad Espanola para el Procesamiento del Lenguaje Natural.Google Scholar

Saadane, H. and Habash, N. (2015). A conventional orthography for Algerian Arabic. In Proceedings of the Second Workshop on Arabic Natural Language Processing, Beijing, China. Association for Computational Linguistics, pp. 69–79.CrossRef Google Scholar

Sadat, F., Kazemi, F. and Farzindar, A. (2014). Automatic identification of Arabic dialects in social media. In Proceedings of the First International Workshop on Social Media Retrieval and Analysis, Queensland, Australia. Association for Computing Machinery, pp. 35–40.CrossRef Google Scholar

Salama, A., Bouamor, H., Mohit, B. and Oflazer, K. (2014). YouDACC: the youtube dialectal Arabic comment corpus. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp. 1246–1251.Google Scholar

Salameh, M., Bouamor, H. and Habash, N. (2018). Fine-grained Arabic dialect identification. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA. Association for Computational Linguistics, pp. 1332–1344.Google Scholar

Samih, Y., Attia, M., Eldesouki, M., Abdelali, A., Mubarak, H., Kallmeyer, L. and Darwish, K. (2017). A neural architecture for dialectal Arabic segmentation. In Proceedings of the Third Arabic Natural Language Processing Workshop, Valencia, Spain. Association for Computational Linguistics, pp. 46–54.CrossRef Google Scholar

Shao, G. (2009). Understanding the appeal of user-generated media: a uses and gratification perspective. Internet Research 19(1), 7–25.CrossRef Google Scholar

Shoufan, A. and Alameri, S. (2015). Natural language processing for dialectical Arabic: a survey. In Proceedings of the Second Workshop on Arabic Natural Language Processing, Beijing, China. Association for Computational Linguistics, pp. 36–48.CrossRef Google Scholar

Siddiqui, M.A., Dahab, M.Y. and Batarfi, O.A. (2015). Building a sentiment analysis corpus with multifaceted hierarchical annotation. International Journal of Computational Linguistics (IJCL) 6(2), 11–25.Google Scholar

Su, Q., Xiang, K., Wang, H., Sun, B. and Yu, S. (2006). Using pointwise mutual information to identify implicit features in customer reviews. In International Conference on Computer Processing of Oriental Languages (ICCPOL), Singapore, Singapore. Springer, pp. 22–30.CrossRef Google Scholar

Takezawa, T., Kikui, G., Mizushima, M. and Sumita, E. (2007). Multilingual spoken language corpus development for communication research. International Journal of Computational Linguistics & Chinese Language Processing, Special Issue on Invited Papers from ISCSLP 2006 12, 303–324.Google Scholar

Testen, D. (2018). Semitic Languages. Encyclopedia Britannica. Available at http://www.britannica.com/EBchecked/topic/534171/Semitic-languages (accessed 10 September 2018).Google Scholar

Toutanova, K., Klein, D., Manning, C.D. and Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1. Association for computational Linguistics, pp. 173–180.CrossRef Google Scholar

Toutanova, K. and Manning, C.D. (2000). Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora: Held in Conjunction with the 38th Annual Meeting of the Association for Computational Linguistics, Hong Kong. Association for Computational Linguistics, pp. 63–70.CrossRef Google Scholar

Turki, H., Adel, E., Daouda, T. and Regragui, N. (2016). A conventional orthography for Maghrebi Arabic. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’16), PortoroÅ¾, Slovenia. European Language Resources Association.Google Scholar

Van de Cruys, T. (2011). Two multivariate generalizations of pointwise mutual information. In Proceedings of the Workshop on Distributional Semantics and Compositionality, Portland, Oregon, USA. Association for Computational Linguistics, pp. 16–20.Google Scholar

Versteegh, K. (2014). Arabic Language. Edinburgh: Edinburgh University Press.CrossRef Google Scholar

Watson, J.C. (2002). The Phonology and Morphology of Arabic. Oxford: Oxford University Press.Google Scholar

Younes, J., Achour, H. and Souissi, E. (2015). Constructing linguistic resources for the Tunisian dialect using textual user-generated contents on the social web. In the 15th International Conference on Web Engineering (ICWE), Rotterdam, the Netherlands. Springer, pp. 3–14.CrossRef Google Scholar

Younes, J. and Souissi, E. (2014). A quantitative view of Tunisian dialect electronic writing. In 5th International Conference on Arabic Language Processing, Qujda, Morocco, pp. 63–72.Google Scholar

Zaghouani, W. and Charfi, A. (2018). Arap-tweet: a large multi-dialect twitter corpus for gender, age and language variety identification. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC’18, Miyazaki, Japan. European Language Resources Association, pp. 694–700.Google Scholar

Zaidan, O.F. and Callison-Burch, C. (2011). The Arabic online commentary dataset: an annotated dataset of informal Arabic with high dialectal content. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers, Portland, Oregon, USA. Association for Computational Linguistics, pp. 37–41.Google Scholar

Zaidan, O.F. and Callison-Burch, C. (2014). Arabic dialect identification. Computational Linguistics 40(1), 171–202.CrossRef Google Scholar

Zbib, R., Malchiodi, E., Devlin, J., Stallard, D., Matsoukas, S., Schwartz, R., Makhoul, J., Zaidan, O.F. and Callison-Burch, C. (2012). Machine translation of Arabic dialects. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Montréal, Canada. Association for Computational Linguistics, pp. 49–59.Google Scholar

Zribi, I., Ellouze, M., Belguith, L.H. and Blache, P. (2015). Spoken Tunisian Arabic corpus (STAC): transcription and annotation. Research in Computing Science, Special Issue: Advances in Computational Linguistics 90(1), 123–135.Google Scholar

Zribi, I., Kammoun, I., Ellouze, M., Belguith, L. and Blache, P. (2016). Sentence boundary detection for transcribed Tunisian Arabic. In Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016), Bochum, Germany. Ruhr-University Bochum and German Society for Linguistics, Section Computational Linguistics (DGfS-CL), pp. 323–331.Google Scholar

Article contents

Creation of annotated country-level dialectal Arabic resources: An unsupervised approach

Abstract

Keywords

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests