Skip to main content Accessibility help
×
Home

Multilingual SMS-based author profiling: Data and methods

Abstract

In the recent years, many benchmark author profiling corpora have been developed for various genres including Twitter, social media, blogs, hotel reviews and e-mail, etc. However, no such standard evaluation resource has been developed for Short Messaging Service (SMS), a popular medium of communication, which is very useful for author profiling. The primary aim of this study is to develop a large multilingual (English and Roman Urdu) benchmark SMS-based author profiling corpus. The proposed corpus contains 810 author profiles, wherein each profile consists of an aggregation of SMS messages as a single document of an author, along with seven demographic traits associated with each author profile: gender, age, native language, native city, qualification, occupation and personality type (introvert/extrovert). The secondary aims of this study include the following: (1) annotating the proposed corpus for code-switching annotations at the lexical level (approximately 0.69 million tokens are manually annotated for code-switching) and (2) applying the stylometry-based method (groups of sixty-four features) and the content-based method (twelve features) for gender identification in order to demonstrate how our proposed corpus can be used for the development and evaluation of various author profiling methods. The results show that the content-based character 5-gram feature outperformed all the other features by obtaining the accuracy score of 0.975 and F1 score of 0.947 for gender identification while using the entire corpus. Furthermore, our proposed corpora (SMS–AP–18 and code-switched SMS–AP–18) are freely and publicly available for research purpose.

Copyright

References

Hide All
Aboluwarin, O., Andriotis, P., Takasu, A., and Tryfonas, T., 2016. Optimizing short message text sentiment analysis for mobile device forensics. In Proceedings of the 12th IFIP WG 11.9 International Conference on Advances in Digital Forensics XII, New Delhi, India, Springer, pp. 69–87.
Afzal, H., and Mehmood, K., 2016. Spam filtering of bi-lingual tweets using machine learning. In Proceedings of the 18th International Conference on Advanced Communication Technology (ICACT), PyeongChang, Korea, IEEE, pp. 710–14.
Ali, I., and Aslam, T. M., 2012. Frequency of learned words of English as a marker of gender identity in SMS language in Pakistan. Journal of Elementary Education 22 (2): 4555.
Alowibdi, J. S., Buy, U. A., and Yu, P. 2013. Language independent gender classification on Twitter. In Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM-2013), Niagara, Ontario, Canada, ACM, pp. 739–43.
Argamon, S., Koppel, M., Pennebaker, J. W., and Schler, J., 2009. Automatically profiling the author of an anonymous text. Communications of the ACM 52 (2): 119–23.
Bhatt, R. M., and Bolonyai, A., 2011. Code-switching and the optimal grammar of bilingual language use. Bilingualism: Language and Cognition 14 (4): 522–46.
Bilal, M., Israr, H., Shahid, M., and Khan, A. 2016. Sentiment classification of Roman-Urdu opinions using Naïve Bayesian, decision tree and KNN classification techniques. Journal of King Saud University – Computer and Information Sciences, 28 (3): 330–44.
Boutwell, S. R. 2011. Authorship attribution of short messages using multimodal features. Master's thesis. Naval Postgraduate School Monterey, California.
Burger, J. D., Henderson, J., Kim, G., and Zarrella, G., 2011. Discriminating gender on Twitter. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Edinburgh, United Kingdom, Association for Computational Linguistics, pp. 1301–9.
Calvo, R. A., Milne, D. N., Hussain, M. S., and Christensen, H. 2017. Natural language processing in mental health applications using non-clinical texts. Natural Language Engineering, 23 (5): 649–85.
Chen, T., and Kan, M. Y., 2013. Creating a live, public short message service corpus: the NUS SMS corpus. Language Resources and Evaluation 47 (2): 299335.
Cheng, N., Chandramouli, R., and Subbalakshmi, K., 2011. Author gender identification from text. Digital Investigation 8 (1): 7888.
Corney, M., de Vel, O., Anderson, A., and Mohay, G., 2002. Gender-preferential text mining of E-mail discourse. In Proceedings of the 18th Annual Computer Security Applications Conference, ACSAC-2002, Las Vegas, NV, IEEE Computer Society, pp. 282–9.
De-Arteaga, M., Jimenez, S., Mancera, S., and Baquero, J. 2013. Author profiling using corpus statistics, lexicons and stylistic features—notebook for PAN at CLEF-2013. In Proceedings of CLEF 2013 Evaluation Labs and Workshop – Working Notes Papers, Valencia, Spain.
Deitrick, W., Miller, Z., Valyou, B., Dickinson, B., Munson, T., and Hu, W., 2012. Gender identification on Twitter using the modified balanced winnow. Communications and Network 4 (3): 189–95.
Del Gaudio, R., Batista, G., and Branco, A., 2014. Coping with highly imbalanced datasets: A case study with definition extraction in a multilingual setting. Natural Language Engineering 20 (3): 327–59.
Eleta, I., and Golbeck, J., 2014. Multilingual use of Twitter: Social networks at the language frontier. Computers in Human Behavior 41 : 424–32.
Eurobarometer, S. 2006. Europeans and their languages. Technical Report, European Commission.
Fairon, C., and Paumier, S., 2006. A translated corpus of 30,000 French SMS. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy, European Language Resources Association (ELRA), pp. 351–4.
Fatima, M., Hasan, K., Anwar, S., and Nawab, R. M. A., 2017. Multilingual author profiling on Facebook. Information Processing & Management 53 (4): 886904.
Flekova, L., Ungar, L., and Preotiuc-Pietro, D. 2016. Exploring stylistic variation with age and income on Twitter. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016), Berlin, Germany.
Giannella, C. R., Winder, R., and Wilson, B., 2015. (Un/Semi-)supervised SMS text message SPAM detection. Natural Language Engineering 21 (4): 553–67.
Glance, N., Hurst, M., Nigam, K., Siegler, M., Stockton, R., and Tomokiyo, T., 2005. Deriving marketing intelligence from online discussion. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (KDD-2005), Chicago, Illinois, USA, pp. 419–28.
Goswami, S., Sarkar, S., and Rustagi, M., 2009. Stylometric analysis of bloggers' age and gender. In Proceedings of the 3rd International AAAI Conference of Weblogs and Social Media (ICWSM'09), San Jose, California, AAAI Press, pp. 214–7.
Halim, N. S., and Maros, M., 2014. The functions of code-switching in Facebook interactions. Procedia-Social and Behavioral Sciences 118 : 126–33.
How, Y., and Kan, M. Y. 2005. Optimizing predictive text entry for short message service on mobile phones. In Proceedings of the 7th International Conference on Human Computer Interaction with Mobile Devices & Services (HCI-2005), Salzburg, Austria.
Ikonomakis, M., Kotsiantis, S., and Tampakas, V., 2005. Text classification using machine learning techniques. WSEAS Transactions on Computers 4 (8): 966–74.
Ishihara, S., 2011. A forensic authorship classification in SMS messages: a likelihood ratio based approach using n-gram. In Proceedings of the Australasian Language Technology Association Workshop 2011, Canberra, Australia, pp. 47–56.
Ishihara, S., 2012. A forensic text comparison in SMS messages: a likelihood ratio approach with lexical features. In Proceedings of the 7th International Workshop on Digital Forensics & Incident Analysis (WDFIA-2012), Crete, Greece, pp. 55–65.
Ishihara, S. 2014. A likelihood ratio-based evaluation of strength of authorship attribution evidence in SMS messages using n-grams. International Journal of Speech, Language & the Law 21 (1).
Javed, I., and Afzal, H., 2013. Opinion analysis of bi-lingual event data from social networks. In Proceedings of Emotion and Sentiment in Social and Expressive Media (ESSEM 2013) A Workshop of the XIII International Conference of the Italian Association for Artificial Intelligence (AI*IA 2013), Turin, Italy, Citeseer, pp. 164–72.
Javed, I., Afzal, H., Majeed, A., and Khan, B., 2014. Towards creation of linguistic resources for bilingual sentiment analysis of Twitter data. In Proceedings of the 19th International Conference on Application of Natural Language to Information Systems, Montpellier, France, Springer, pp. 232–6.
Kebede, A. M., Tefrie, K. G., and Sohn, K. A., 2015. Anonymous author similarity identification. In Proceedings of the 5th International Conference on IT Convergence and Security (ICITCS-2015), Kuala Lumpur, Malaysia, pp. 1–5.
Kiritchenko, S., Zhu, X., and Mohammad, S. M., 2014. Sentiment analysis of short informal texts. Journal of Artificial Intelligence Research 50 : 723–62.
Kretchmar, M., and Zhao, Y., 2014. Text message authorship classification using Kernel Support Vector Machines. In Proceedings of the International Conference on Computational Science and Computational Intelligence (CSCI-2014), vol. 2, Las Vegas, Nevada, USA, IEEE, pp. 215–8.
Kubát, M., and Milička, J., 2013. Vocabulary richness measure in genres. Journal of Quantitative Linguistics 20 (4): 339–49.
Layton, R., Watters, P., and Dazeley, R. 2010. Authorship attribution for twitter in 140 characters or less. In Proceedings of the 2nd Cybercrime and Trustworthy Computing Workshop (CTC-2010), IEEE, pp. 1–8.
Layton, R., Watters, P., and Dazeley, R., 2012. Recentred local profiles for authorship attribution. Natural Language Engineering 18 (03): 293312.
Layton, R., Watters, P., and Dazeley, R., 2013. Automated unsupervised authorship analysis using evidence accumulation clustering. Natural Language Engineering 19 (1): 95120.
Malmasi, S., and Dras, M., 2017. Multilingual native language identification. Natural Language Engineering 23 (2): 163215.
Meylaerts, R. 2010. Multilingualism and translation. Handbook of translation studies, Amsterdam, John Benjamins Publishing, vol. 1, pp. 227–30.
Mikros, G. K., 2012. Authorship attribution and gender identification in Greek blogs. In Proceedings of the Methods and Applications of Quantitative Linguistics, University of Belgrade, Serbia, pp. 21–32.
Mikros, G. K., and Perifanos, K. 2013. Authorship attribution in Greek tweets using author's multilevel n-gram profiles. In Proceedings of AAAI 2013 Spring Symposium: Analyzing Microtext (SAM-2013), Stanford, USA, AAAI Press.
Miller, Z., Dickinson, B., and Hu, W., 2012. Gender prediction on Twitter using stream algorithms with n-gram character features. International Journal of Intelligence Science 2 (24): 143–8.
Mohan, A., Baggili, I. M., and Rogers, M. K. 2010. Authorship attribution of SMS messages using an n-grams approach. Technical Report, CERIAS2010-11, College of Technology.
Mukund, S., and Srihari, R. K., 2012. Analyzing Urdu social media for sentiments using transfer learning with controlled translations. In Proceedings of the Second Workshop on Language in Social Media (LSM-2012), Montreal, Canada, Association for Computational Linguistics, pp. 1–8.
Munro, R., and Manning, C. D., 2012. Short message communications: Users, topics, and in-language processing. In Proceedings of the 2nd ACM Symposium on Computing for Development (ACM DEV-2012), Atlanta, Georgia, pp. 4:1–10.
Nguyen, D., Smith, N. A., and Rosé, C. P., 2011. Author age prediction from text using Linear Regression. In Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH-2011), Portland, Oregon, Association for Computational Linguistics, pp. 115–23.
Oberlander, J., and Nowson, S., 2006. Whose thumb is it anyway?: Classifying author personality from weblog text. In Proceedings of the COLING/ACL on Main Conference Poster Sessions (COLING-ACL-2006), Sydney, Australia, pp. 627–34.
Oliva, J., Serrano, J. I., Del Castillo, M. D., and Igesias, A., 2013. A SMS normalization system integrating multiple grammatical resources. Natural Language Engineering 19 (01): 121–41.
Peersman, C., Daelemans, W., and Van Vaerenbergh, L. 2011. Predicting age and gender in online social networks. In Proceedings of the 3rd International Workshop on Search and Mining User-Generated Contents (SMUC-2011), ACM, pp. 37–44.
Pervaz, I., Ameer, I., Sittar, A., and Nawab, R. M. A. 2015. Identification of author personality traits using stylistic features—notebook for PAN at CLEF 2015. In Proceedings of Evaluation Labs and Workshop – Working Notes Papers (CLEF-2015), Toulouse, France. CEUR-WS.org.
Przybyła, P., and Teisseyre, P. 2015. What do your look-alikes say about you? Exploiting strong and weak similarities for author profiling—Notebook for PAN at CLEF 2015. In Evaluation Labs and Workshop – Working Notes Papers (CLEF-2015), Toulouse, France. CEUR-WS.org.
Ragel, R., Herath, P., and Senanayake, U., 2013. Authorship detection of SMS messages using unigrams. In Proceedings of the 2013 IEEE 8th International Conference on Industrial and Information Systems (ICIIS-2013), Sri Lanka, IEEE, pp. 387–92.
Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., and Inches, G. 2013. Overview of the author profiling task at PAN 2013. In Proceedings of Evaluation Labs and Workshop – Working Notes Papers (CLEF-2013), Valencia, Spain.
Rangel, F., Rosso, P., Potthast, M., and Stein, B. 2017. Overview of the 5th author profiling task at PAN 2017: gender and language variety identification in Twitter. In Working Notes Papers of the CLEF 2017 Evaluation Labs, CEUR Workshop Proceedings. CLEF and CEUR-WS.org.
Rangel, F., Rosso, P., Potthast, M., Stein, B., and Daelemans, W. 2015. Overview of the 3rd author profiling task at PAN 2015. In Proceedings of Evaluation Labs and Workshop – Working Notes Papers CLEF-2015, Toulouse, France. CEUR-WS.org.
Rangel, F., Rosso, P., Potthast, M., Trenkmann, M., Stein, B., Verhoeven, B., and Daeleman, W. 2014. Overview of the 2nd author profiling task at PAN 2014. In Evaluation Labs and Workshop – Working Notes Papers (CLEF-2014), Sheffield, UK. CEUR-WS.org.
Rangel, F., Rosso, P., Verhoeven, B., Daeleman, W., Potthast, M., and Stein, B., 2016. Overview of the 4th author profiling task at PAN 2016: cross-genre evaluations. In Proceedings of Working Notes of CLEF 2016 - Conference and Labs of the Evaluation forum, Évora, Portugal, pp. 750–84.
Santosh, K., Bansal, R., Shekhar, M., and Varma, V. 2013. Author profiling: Predicting age and gender from blogs—notebook for PAN at CLEF 2013. In Evaluation Labs and Workshop – Working Notes Papers (CLEF-2013), Valencia, Spain.
Schler, J., Koppel, M., Argamon, S., and Pennebaker, J. W., 2006. Effects of age and gender on blogging. In Proceedings of AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs, vol. 6, Palo Alto, California, AAAI Press, pp. 199–205.
Schwartz, R., Tsur, O., Rappoport, A., and Koppel, M. 2013. Authorship attribution of micro-messages. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1880–91.
Shrestha, P., Rey-Villamizar, N., Sadeque, F., Pedersen, T., Bethard, S., and Solorio, T. 2016. Age and gender prediction on health forum data. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC-2016). European Language Resources Association (ELRA).
Silessi, S., Varol, C., and Karabatak, M., 2016. Identifying gender from SMS text messages. In Proceedings of the 15th IEEE International Conference on Machine Learning and Applications (ICMLA), Anaheim, California, USA, IEEE, pp. 488–91.
Sokolova, M., Japkowicz, N., and Szpakowicz, S. 2006. Beyond accuracy, F-score and ROC: a family of discriminant measures for performance evaluation. In Proceedings of the Australian Conference on Artificial Intelligence, vol. 4304, pp. 1015–21.
Sokolova, M., and Lapalme, G., 2009. A systematic analysis of performance measures for classification tasks. Information Processing & Management 45 (4): 427–37.
Soler, J., and Wanner, L. 2016. A semi-supervised approach for gender identification. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC-2016), Portorož, Slovenia, European Language Resources Association (ELRA).
Song, Z., Strassel, S., Lee, H., Walker, K., Wright, J., Garland, J., Fore, D., Gainor, B., Cabe, P., Thomas, T., Callahan, B., and Sawyer, A. 2014. Collecting natural SMS and chat conversations in multiple languages: The BOLT phase 2 corpus. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC-2014), Reykjavik, Iceland, European Language Resources Association (ELRA).
Sridhar, V. K. R., Chen, J., Bangalore, S., and Shacham, R. 2014. A framework for translating SMS messages. In Proceedings of the 25th International Conference on Computational Linguistics (COLING-2014), pp. 974–83.
Stamatatos, E., Fakotakis, N., and Kokkinakis, G., 2000. Automatic text categorization in terms of genre and author. Computational Linguistics 26 (4): 471–95.
Treurniet, M., De Clercq, O., Van Den Heuvel, H., and Oostdijk, N., 2012. Collecting a corpus of Dutch SMS. In Proceedings of the 8th International Conference on Language Resources and Evaluation Conference (LREC-2012), Istanbul, Turkey, European Language Resources Association (ELRA), pp. 2268–73.
Van de Loo, J., De Pauw, G., and Daelemans, W., 2016. Text-based age and gender prediction for online safety monitoring. International Journal of Cyber-Security and Digital Forensics 5 (1): 4660.
Vicente, M., Batista, F., and Carvalho, J. P. 2016. Improving Twitter gender classification using multiple classifiers*. In Proceedings of the 8th European Symposium on Computational Intelligence and Mathematics (ESCIM-2016) , Sofia, Bulgaria, pp. 121–7.
Wanner, L. 2015. Multiple language gender identification for blog posts. In Proceedings of the 37th Annual Meeting of the Cognitive Science Society, pp. 2248–51.
Witten, I. H., Frank, E., and Hall, M. 2011. Data mining: Practical machine learning tools and techniques (3rd ed.), The Morgan Kaufmann Series in Data Management Systems, Elsevier Science.
Yang, Y., and Pedersen, J. O., 1997. A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning (ICML-1997), Nashville, TN, USA, Morgan Kaufmann Publishers Inc., pp. 412–20.

Metrics

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed