Multilingual SMS-based author profiling: Data and methods

MEHWISH FATIMA; SABA ANWAR; AMNA NAVEED; WAQAS ARSHAD; RAO MUHAMMAD ADEEL NAWAB; MUNTAHA IQBAL; ALIA MASOOD

doi:10.1017/S1351324918000244

Multilingual SMS-based author profiling: Data and methods

Published online by Cambridge University Press: 26 June 2018

MEHWISH FATIMA

SABA ANWAR ,

AMNA NAVEED ,

WAQAS ARSHAD ,

RAO MUHAMMAD ADEEL NAWAB ,

MUNTAHA IQBAL and

ALIA MASOOD

Show author details

MEHWISH FATIMA: Affiliation:
Department of Computer Science, COMSATS University Islamabad, Lahore Campus, Pakistan e-mails: adeelnawab@ciitlahore.edu.pk, amnanaveed@ciitlahore.edu.pk, mehwish.fatima@ciitlahore.edu.pk, sabaanwar@ciitlahore.edu.pk, aliamasood2020@gmail.com
SABA ANWAR: Affiliation:
Department of Computer Science, COMSATS University Islamabad, Lahore Campus, Pakistan e-mails: adeelnawab@ciitlahore.edu.pk, amnanaveed@ciitlahore.edu.pk, mehwish.fatima@ciitlahore.edu.pk, sabaanwar@ciitlahore.edu.pk, aliamasood2020@gmail.com
AMNA NAVEED: Affiliation:
Department of Computer Science, COMSATS University Islamabad, Lahore Campus, Pakistan e-mails: adeelnawab@ciitlahore.edu.pk, amnanaveed@ciitlahore.edu.pk, mehwish.fatima@ciitlahore.edu.pk, sabaanwar@ciitlahore.edu.pk, aliamasood2020@gmail.com
WAQAS ARSHAD: Affiliation:
Department of Computer Science & IT, Superior University, Lahore, Pakistan e-mail: waqas.arshad@superior.edu.com.pk
RAO MUHAMMAD ADEEL NAWAB: Affiliation:
Department of Computer Science, COMSATS University Islamabad, Lahore Campus, Pakistan e-mails: adeelnawab@ciitlahore.edu.pk, amnanaveed@ciitlahore.edu.pk, mehwish.fatima@ciitlahore.edu.pk, sabaanwar@ciitlahore.edu.pk, aliamasood2020@gmail.com
MUNTAHA IQBAL: Affiliation:
Al-Khwarizmi Institute of Computer Science, University of Engineering & Technology, Lahore, Pakistan e-mail: muntaha.iqbal@kics.edu.pk
ALIA MASOOD: Affiliation:
Department of Computer Science, COMSATS University Islamabad, Lahore Campus, Pakistan e-mails: adeelnawab@ciitlahore.edu.pk, amnanaveed@ciitlahore.edu.pk, mehwish.fatima@ciitlahore.edu.pk, sabaanwar@ciitlahore.edu.pk, aliamasood2020@gmail.com

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

In the recent years, many benchmark author profiling corpora have been developed for various genres including Twitter, social media, blogs, hotel reviews and e-mail, etc. However, no such standard evaluation resource has been developed for Short Messaging Service (SMS), a popular medium of communication, which is very useful for author profiling. The primary aim of this study is to develop a large multilingual (English and Roman Urdu) benchmark SMS-based author profiling corpus. The proposed corpus contains 810 author profiles, wherein each profile consists of an aggregation of SMS messages as a single document of an author, along with seven demographic traits associated with each author profile: gender, age, native language, native city, qualification, occupation and personality type (introvert/extrovert). The secondary aims of this study include the following: (1) annotating the proposed corpus for code-switching annotations at the lexical level (approximately 0.69 million tokens are manually annotated for code-switching) and (2) applying the stylometry-based method (groups of sixty-four features) and the content-based method (twelve features) for gender identification in order to demonstrate how our proposed corpus can be used for the development and evaluation of various author profiling methods. The results show that the content-based character 5-gram feature outperformed all the other features by obtaining the accuracy score of 0.975 and F1 score of 0.947 for gender identification while using the entire corpus. Furthermore, our proposed corpora (SMS–AP–18 and code-switched SMS–AP–18) are freely and publicly available for research purpose.

Type: Article
Information: Natural Language Engineering , Volume 24 , Issue 5 , September 2018 , pp. 695 - 724

DOI: https://doi.org/10.1017/S1351324918000244 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2018

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Aboluwarin, O., Andriotis, P., Takasu, A., and Tryfonas, T., 2016. Optimizing short message text sentiment analysis for mobile device forensics. In Proceedings of the 12th IFIP WG 11.9 International Conference on Advances in Digital Forensics XII, New Delhi, India, Springer, pp. 69–87.Google Scholar

Afzal, H., and Mehmood, K., 2016. Spam filtering of bi-lingual tweets using machine learning. In Proceedings of the 18th International Conference on Advanced Communication Technology (ICACT), PyeongChang, Korea, IEEE, pp. 710–14.Google Scholar

Ali, I., and Aslam, T. M., 2012. Frequency of learned words of English as a marker of gender identity in SMS language in Pakistan. Journal of Elementary Education 22 (2): 45–55.Google Scholar

Alowibdi, J. S., Buy, U. A., and Yu, P. 2013. Language independent gender classification on Twitter. In Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM-2013), Niagara, Ontario, Canada, ACM, pp. 739–43.Google Scholar

Argamon, S., Koppel, M., Pennebaker, J. W., and Schler, J., 2009. Automatically profiling the author of an anonymous text. Communications of the ACM 52 (2): 119–23.Google Scholar

Bhatt, R. M., and Bolonyai, A., 2011. Code-switching and the optimal grammar of bilingual language use. Bilingualism: Language and Cognition 14 (4): 522–46.Google Scholar

Bilal, M., Israr, H., Shahid, M., and Khan, A. 2016. Sentiment classification of Roman-Urdu opinions using Naïve Bayesian, decision tree and KNN classification techniques. Journal of King Saud University – Computer and Information Sciences, 28 (3): 330–44.Google Scholar

Boutwell, S. R. 2011. Authorship attribution of short messages using multimodal features. Master's thesis. Naval Postgraduate School Monterey, California.Google Scholar

Burger, J. D., Henderson, J., Kim, G., and Zarrella, G., 2011. Discriminating gender on Twitter. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Edinburgh, United Kingdom, Association for Computational Linguistics, pp. 1301–9.Google Scholar

Calvo, R. A., Milne, D. N., Hussain, M. S., and Christensen, H. 2017. Natural language processing in mental health applications using non-clinical texts. Natural Language Engineering, 23 (5): 649–85.Google Scholar

Chen, T., and Kan, M. Y., 2013. Creating a live, public short message service corpus: the NUS SMS corpus. Language Resources and Evaluation 47 (2): 299–335.Google Scholar

Cheng, N., Chandramouli, R., and Subbalakshmi, K., 2011. Author gender identification from text. Digital Investigation 8 (1): 78–88.Google Scholar

Corney, M., de Vel, O., Anderson, A., and Mohay, G., 2002. Gender-preferential text mining of E-mail discourse. In Proceedings of the 18th Annual Computer Security Applications Conference, ACSAC-2002, Las Vegas, NV, IEEE Computer Society, pp. 282–9.Google Scholar

De-Arteaga, M., Jimenez, S., Mancera, S., and Baquero, J. 2013. Author profiling using corpus statistics, lexicons and stylistic features—notebook for PAN at CLEF-2013. In Proceedings of CLEF 2013 Evaluation Labs and Workshop – Working Notes Papers, Valencia, Spain.Google Scholar

Deitrick, W., Miller, Z., Valyou, B., Dickinson, B., Munson, T., and Hu, W., 2012. Gender identification on Twitter using the modified balanced winnow. Communications and Network 4 (3): 189–95.Google Scholar

Del Gaudio, R., Batista, G., and Branco, A., 2014. Coping with highly imbalanced datasets: A case study with definition extraction in a multilingual setting. Natural Language Engineering 20 (3): 327–59.Google Scholar

Eleta, I., and Golbeck, J., 2014. Multilingual use of Twitter: Social networks at the language frontier. Computers in Human Behavior 41 : 424–32.Google Scholar

Eurobarometer, S. 2006. Europeans and their languages. Technical Report, European Commission.Google Scholar

Fairon, C., and Paumier, S., 2006. A translated corpus of 30,000 French SMS. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy, European Language Resources Association (ELRA), pp. 351–4.Google Scholar

Fatima, M., Hasan, K., Anwar, S., and Nawab, R. M. A., 2017. Multilingual author profiling on Facebook. Information Processing & Management 53 (4): 886–904.Google Scholar

Flekova, L., Ungar, L., and Preotiuc-Pietro, D. 2016. Exploring stylistic variation with age and income on Twitter. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016), Berlin, Germany.Google Scholar

Giannella, C. R., Winder, R., and Wilson, B., 2015. (Un/Semi-)supervised SMS text message SPAM detection. Natural Language Engineering 21 (4): 553–67.Google Scholar

Glance, N., Hurst, M., Nigam, K., Siegler, M., Stockton, R., and Tomokiyo, T., 2005. Deriving marketing intelligence from online discussion. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (KDD-2005), Chicago, Illinois, USA, pp. 419–28.Google Scholar

Goswami, S., Sarkar, S., and Rustagi, M., 2009. Stylometric analysis of bloggers' age and gender. In Proceedings of the 3rd International AAAI Conference of Weblogs and Social Media (ICWSM'09), San Jose, California, AAAI Press, pp. 214–7.Google Scholar

Halim, N. S., and Maros, M., 2014. The functions of code-switching in Facebook interactions. Procedia-Social and Behavioral Sciences 118 : 126–33.Google Scholar

How, Y., and Kan, M. Y. 2005. Optimizing predictive text entry for short message service on mobile phones. In Proceedings of the 7th International Conference on Human Computer Interaction with Mobile Devices & Services (HCI-2005), Salzburg, Austria.Google Scholar

Ikonomakis, M., Kotsiantis, S., and Tampakas, V., 2005. Text classification using machine learning techniques. WSEAS Transactions on Computers 4 (8): 966–74.Google Scholar

Ishihara, S., 2011. A forensic authorship classification in SMS messages: a likelihood ratio based approach using n-gram. In Proceedings of the Australasian Language Technology Association Workshop 2011, Canberra, Australia, pp. 47–56.Google Scholar

Ishihara, S., 2012. A forensic text comparison in SMS messages: a likelihood ratio approach with lexical features. In Proceedings of the 7th International Workshop on Digital Forensics & Incident Analysis (WDFIA-2012), Crete, Greece, pp. 55–65.Google Scholar

Ishihara, S. 2014. A likelihood ratio-based evaluation of strength of authorship attribution evidence in SMS messages using n-grams. International Journal of Speech, Language & the Law 21 (1).Google Scholar

Javed, I., and Afzal, H., 2013. Opinion analysis of bi-lingual event data from social networks. In Proceedings of Emotion and Sentiment in Social and Expressive Media (ESSEM 2013) A Workshop of the XIII International Conference of the Italian Association for Artificial Intelligence (AI*IA 2013), Turin, Italy, Citeseer, pp. 164–72.Google Scholar

Javed, I., Afzal, H., Majeed, A., and Khan, B., 2014. Towards creation of linguistic resources for bilingual sentiment analysis of Twitter data. In Proceedings of the 19th International Conference on Application of Natural Language to Information Systems, Montpellier, France, Springer, pp. 232–6.Google Scholar

Kebede, A. M., Tefrie, K. G., and Sohn, K. A., 2015. Anonymous author similarity identification. In Proceedings of the 5th International Conference on IT Convergence and Security (ICITCS-2015), Kuala Lumpur, Malaysia, pp. 1–5.Google Scholar

Kiritchenko, S., Zhu, X., and Mohammad, S. M., 2014. Sentiment analysis of short informal texts. Journal of Artificial Intelligence Research 50 : 723–62.Google Scholar

Kretchmar, M., and Zhao, Y., 2014. Text message authorship classification using Kernel Support Vector Machines. In Proceedings of the International Conference on Computational Science and Computational Intelligence (CSCI-2014), vol. 2, Las Vegas, Nevada, USA, IEEE, pp. 215–8.Google Scholar

Kubát, M., and Milička, J., 2013. Vocabulary richness measure in genres. Journal of Quantitative Linguistics 20 (4): 339–49.Google Scholar

Layton, R., Watters, P., and Dazeley, R. 2010. Authorship attribution for twitter in 140 characters or less. In Proceedings of the 2nd Cybercrime and Trustworthy Computing Workshop (CTC-2010), IEEE, pp. 1–8.Google Scholar

Layton, R., Watters, P., and Dazeley, R., 2012. Recentred local profiles for authorship attribution. Natural Language Engineering 18 (03): 293–312.Google Scholar

Layton, R., Watters, P., and Dazeley, R., 2013. Automated unsupervised authorship analysis using evidence accumulation clustering. Natural Language Engineering 19 (1): 95–120.Google Scholar

Malmasi, S., and Dras, M., 2017. Multilingual native language identification. Natural Language Engineering 23 (2): 163–215.Google Scholar

Meylaerts, R. 2010. Multilingualism and translation. Handbook of translation studies, Amsterdam, John Benjamins Publishing, vol. 1, pp. 227–30.Google Scholar

Mikros, G. K., 2012. Authorship attribution and gender identification in Greek blogs. In Proceedings of the Methods and Applications of Quantitative Linguistics, University of Belgrade, Serbia, pp. 21–32.Google Scholar

Mikros, G. K., and Perifanos, K. 2013. Authorship attribution in Greek tweets using author's multilevel n-gram profiles. In Proceedings of AAAI 2013 Spring Symposium: Analyzing Microtext (SAM-2013), Stanford, USA, AAAI Press.Google Scholar

Miller, Z., Dickinson, B., and Hu, W., 2012. Gender prediction on Twitter using stream algorithms with n-gram character features. International Journal of Intelligence Science 2 (24): 143–8.Google Scholar

Mohan, A., Baggili, I. M., and Rogers, M. K. 2010. Authorship attribution of SMS messages using an n-grams approach. Technical Report, CERIAS2010-11, College of Technology.Google Scholar

Mukund, S., and Srihari, R. K., 2012. Analyzing Urdu social media for sentiments using transfer learning with controlled translations. In Proceedings of the Second Workshop on Language in Social Media (LSM-2012), Montreal, Canada, Association for Computational Linguistics, pp. 1–8.Google Scholar

Munro, R., and Manning, C. D., 2012. Short message communications: Users, topics, and in-language processing. In Proceedings of the 2nd ACM Symposium on Computing for Development (ACM DEV-2012), Atlanta, Georgia, pp. 4:1–10.Google Scholar

Nguyen, D., Smith, N. A., and Rosé, C. P., 2011. Author age prediction from text using Linear Regression. In Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH-2011), Portland, Oregon, Association for Computational Linguistics, pp. 115–23.Google Scholar

Oberlander, J., and Nowson, S., 2006. Whose thumb is it anyway?: Classifying author personality from weblog text. In Proceedings of the COLING/ACL on Main Conference Poster Sessions (COLING-ACL-2006), Sydney, Australia, pp. 627–34.Google Scholar

Oliva, J., Serrano, J. I., Del Castillo, M. D., and Igesias, A., 2013. A SMS normalization system integrating multiple grammatical resources. Natural Language Engineering 19 (01): 121–41.Google Scholar

Peersman, C., Daelemans, W., and Van Vaerenbergh, L. 2011. Predicting age and gender in online social networks. In Proceedings of the 3rd International Workshop on Search and Mining User-Generated Contents (SMUC-2011), ACM, pp. 37–44.Google Scholar

Pervaz, I., Ameer, I., Sittar, A., and Nawab, R. M. A. 2015. Identification of author personality traits using stylistic features—notebook for PAN at CLEF 2015. In Proceedings of Evaluation Labs and Workshop – Working Notes Papers (CLEF-2015), Toulouse, France. CEUR-WS.org.Google Scholar

Przybyła, P., and Teisseyre, P. 2015. What do your look-alikes say about you? Exploiting strong and weak similarities for author profiling—Notebook for PAN at CLEF 2015. In Evaluation Labs and Workshop – Working Notes Papers (CLEF-2015), Toulouse, France. CEUR-WS.org.Google Scholar

Ragel, R., Herath, P., and Senanayake, U., 2013. Authorship detection of SMS messages using unigrams. In Proceedings of the 2013 IEEE 8th International Conference on Industrial and Information Systems (ICIIS-2013), Sri Lanka, IEEE, pp. 387–92.Google Scholar

Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., and Inches, G. 2013. Overview of the author profiling task at PAN 2013. In Proceedings of Evaluation Labs and Workshop – Working Notes Papers (CLEF-2013), Valencia, Spain.Google Scholar

Rangel, F., Rosso, P., Potthast, M., and Stein, B. 2017. Overview of the 5th author profiling task at PAN 2017: gender and language variety identification in Twitter. In Working Notes Papers of the CLEF 2017 Evaluation Labs, CEUR Workshop Proceedings. CLEF and CEUR-WS.org.Google Scholar

Rangel, F., Rosso, P., Potthast, M., Stein, B., and Daelemans, W. 2015. Overview of the 3rd author profiling task at PAN 2015. In Proceedings of Evaluation Labs and Workshop – Working Notes Papers CLEF-2015, Toulouse, France. CEUR-WS.org.Google Scholar

Rangel, F., Rosso, P., Potthast, M., Trenkmann, M., Stein, B., Verhoeven, B., and Daeleman, W. 2014. Overview of the 2nd author profiling task at PAN 2014. In Evaluation Labs and Workshop – Working Notes Papers (CLEF-2014), Sheffield, UK. CEUR-WS.org.Google Scholar

Rangel, F., Rosso, P., Verhoeven, B., Daeleman, W., Potthast, M., and Stein, B., 2016. Overview of the 4th author profiling task at PAN 2016: cross-genre evaluations. In Proceedings of Working Notes of CLEF 2016 - Conference and Labs of the Evaluation forum, Évora, Portugal, pp. 750–84.Google Scholar

Santosh, K., Bansal, R., Shekhar, M., and Varma, V. 2013. Author profiling: Predicting age and gender from blogs—notebook for PAN at CLEF 2013. In Evaluation Labs and Workshop – Working Notes Papers (CLEF-2013), Valencia, Spain.Google Scholar

Schler, J., Koppel, M., Argamon, S., and Pennebaker, J. W., 2006. Effects of age and gender on blogging. In Proceedings of AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs, vol. 6, Palo Alto, California, AAAI Press, pp. 199–205.Google Scholar

Schwartz, R., Tsur, O., Rappoport, A., and Koppel, M. 2013. Authorship attribution of micro-messages. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1880–91.Google Scholar

Shrestha, P., Rey-Villamizar, N., Sadeque, F., Pedersen, T., Bethard, S., and Solorio, T. 2016. Age and gender prediction on health forum data. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC-2016). European Language Resources Association (ELRA).Google Scholar

Silessi, S., Varol, C., and Karabatak, M., 2016. Identifying gender from SMS text messages. In Proceedings of the 15th IEEE International Conference on Machine Learning and Applications (ICMLA), Anaheim, California, USA, IEEE, pp. 488–91.Google Scholar

Sokolova, M., Japkowicz, N., and Szpakowicz, S. 2006. Beyond accuracy, F-score and ROC: a family of discriminant measures for performance evaluation. In Proceedings of the Australian Conference on Artificial Intelligence, vol. 4304, pp. 1015–21.Google Scholar

Sokolova, M., and Lapalme, G., 2009. A systematic analysis of performance measures for classification tasks. Information Processing & Management 45 (4): 427–37.Google Scholar

Soler, J., and Wanner, L. 2016. A semi-supervised approach for gender identification. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC-2016), Portorož, Slovenia, European Language Resources Association (ELRA).Google Scholar

Song, Z., Strassel, S., Lee, H., Walker, K., Wright, J., Garland, J., Fore, D., Gainor, B., Cabe, P., Thomas, T., Callahan, B., and Sawyer, A. 2014. Collecting natural SMS and chat conversations in multiple languages: The BOLT phase 2 corpus. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC-2014), Reykjavik, Iceland, European Language Resources Association (ELRA).Google Scholar

Sridhar, V. K. R., Chen, J., Bangalore, S., and Shacham, R. 2014. A framework for translating SMS messages. In Proceedings of the 25th International Conference on Computational Linguistics (COLING-2014), pp. 974–83.Google Scholar

Stamatatos, E., Fakotakis, N., and Kokkinakis, G., 2000. Automatic text categorization in terms of genre and author. Computational Linguistics 26 (4): 471–95.Google Scholar

Treurniet, M., De Clercq, O., Van Den Heuvel, H., and Oostdijk, N., 2012. Collecting a corpus of Dutch SMS. In Proceedings of the 8th International Conference on Language Resources and Evaluation Conference (LREC-2012), Istanbul, Turkey, European Language Resources Association (ELRA), pp. 2268–73.Google Scholar

Van de Loo, J., De Pauw, G., and Daelemans, W., 2016. Text-based age and gender prediction for online safety monitoring. International Journal of Cyber-Security and Digital Forensics 5 (1): 46–60.Google Scholar

Vicente, M., Batista, F., and Carvalho, J. P. 2016. Improving Twitter gender classification using multiple classifiers*. In Proceedings of the 8th European Symposium on Computational Intelligence and Mathematics (ESCIM-2016) , Sofia, Bulgaria, pp. 121–7.Google Scholar

Wanner, L. 2015. Multiple language gender identification for blog posts. In Proceedings of the 37th Annual Meeting of the Cognitive Science Society, pp. 2248–51.Google Scholar

Witten, I. H., Frank, E., and Hall, M. 2011. Data mining: Practical machine learning tools and techniques (3rd ed.), The Morgan Kaufmann Series in Data Management Systems, Elsevier Science.Google Scholar

Yang, Y., and Pedersen, J. O., 1997. A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning (ICML-1997), Nashville, TN, USA, Morgan Kaufmann Publishers Inc., pp. 412–20.Google Scholar

Article contents

Multilingual SMS-based author profiling: Data and methods

Abstract

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests