Term evaluation metrics in imbalanced text categorization

Behzad Naderalvojoud; Ebru Akcapinar Sezer

doi:10.1017/S1351324919000317

Term evaluation metrics in imbalanced text categorization

Published online by Cambridge University Press: 12 July 2019

Behzad Naderalvojoud

and

Ebru Akcapinar Sezer

Show author details

Behzad Naderalvojoud*: Affiliation:
Department of Computer Engineering, Hacettepe University, 06800, Ankara, Turkey
Ebru Akcapinar Sezer: Affiliation:
Department of Computer Engineering, Hacettepe University, 06800, Ankara, Turkey
*: *Corresponding author. Emails: n.behzad@hacettepe.edu.tr, ebru@hacettepe.edu.tr

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

This paper proposes four novel term evaluation metrics to represent documents in the text categorization where class distribution is imbalanced. These metrics are achieved from the revision of the four common term evaluation metrics: chi-square, information gain, odds ratio, and relevance frequency. While the common metrics require a balanced class distribution, our proposed metrics evaluate the document terms under an imbalanced distribution. They calculate the degree of relatedness of terms with respect to minor and major classes by considering their imbalanced distribution. Using these metrics in the document representation makes a better distinction between the documents of the minor and major classes and improves the performance of machine learning algorithms. The proposed metrics are assessed over three popular benchmarks (two subsets of Reuters-21578 and WebKB) by using four classification algorithms: support vector machines, naive Bayes, decision trees, and centroid-based classifiers. Our empirical results indicate that the proposed metrics outperform the common metrics in the imbalanced text categorization.

Keywords

Text classification Class imbalance problem Term evaluation Machine learning

Type: Article
Information: Natural Language Engineering , Volume 26 , Issue 1 , January 2020 , pp. 31 - 47

DOI: https://doi.org/10.1017/S1351324919000317 [Opens in a new window]
Copyright: © Cambridge University Press 2019

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Awasare, V.K. and Gupta, S. 2017. Classification of imbalanced datasets using partition method and support vector machine. In Second International Conference on Electrical, Computer and Communication Technologies (ICECCT), IEEE, pp. 1–7.Google Scholar

Bellinger, C., Drummond, C. and Japkowicz, N. 2018. Manifold-based synthetic oversampling with manifold conformance estimation. Machine Learning 107(3): 605–637.CrossRef Google Scholar

Bloodgood, M. 2018. Support vector machine active learning algorithms with query-by-committee versus closest-to-hyperplane selection. In 12th International Conference on Semantic Computing (ICSC), IEEE, pp. 148–155.Google Scholar

Cachopo, A.M.d.J.C. 2007. Improving methods for single-label text categorization. Ph.D. dissertation, Universidade Técnica de Lisboa.Google Scholar

Chang, C.-C. and Lin, C.-J. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) 2(3): 27.Google Scholar

Chawla, N.V., Japkowicz, N. and Kotcz, A. 2004. Editorial: Special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter 6(1): 1–6.CrossRef Google Scholar

Chawla, N.V., Bowyer, K.W., Hall, L.O. and Kegelmeyer, W.P. 2002. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16: 321–357.CrossRef Google Scholar

Chen, E., Lin, Y., Xiong, H., Luo, Q. and Ma, H. 2011. Exploiting probabilistic topic models to improve text categorization under class imbalance. Information Processing and Management 47(2): 202–214.CrossRef Google Scholar

Debole, F. and Sebastiani, F. 2004. Supervised term weighting for automated text categorization. In Text mining and its applications, Springer, pp. 81–97.CrossRef Google Scholar

Deng, Z.H., Luo, K.H. and Yu, H.L. 2014. A study of supervised term weighting scheme for sentiment analysis. Expert Systems with Applications 41(7): 3506–3513.CrossRef Google Scholar

Domeniconi, G., Moro, G., Pasolini, R. and Sartori, C. 2015. A Study on term weighting for text categorization: A novel supervised variant of tf.idf. In Proceedings of 4th International Conference on Data Management Technologies and Applications, pp. 26–37.Google Scholar

Dougherty, J., Kohavi, R. and Sahami, M. 1995. Supervised and unsupervised discretization of continuous features. In Proceedings of the Twelfth International Conference on Machine Learning, pp. 194–202.Google Scholar

Erenel, Z. and Altnçay, H. 2012. Nonlinear transformation of term frequencies for term weighting in text categorization. Engineering Applications of Artificial Intelligence 25(7): 1505–1514.CrossRef Google Scholar

Guo, H. and Viktor, H.L. 2004. Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach. ACM SIGKDD Explorations Newsletter - Special Issue on Learning from Imbalanced Datasets 6(1): 30–39.CrossRef Google Scholar

Haddoud, M., Mokhtari, A., Lecroq, T. and Abdeddaïm, S. 2016. Combining supervised term-weighting metrics for SVM text classification with extended term representation. Knowledge and Information Systems 49(3): 909–931.CrossRef Google Scholar

He, H. and Garcia, E.A. 2009. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21(9): 1263–1284.Google Scholar

Iglesias, E.L., Seara Vieira, A. and Borrajo, L. 2013. An HMM-based over-sampling technique to improve text classification. Expert Systems with Applications 40(18): 7184–7192.CrossRef Google Scholar

Japkowicz, N. and Stephen, S. 2002. The class imbalance problem: A systematic study. Intelligent Data Analysis 6(5): 429–449.CrossRef Google Scholar

Kim, H.K. and Kim, M. 2016. Model-induced term-weighting schemes for text classification. Applied Intelligence 45(1): 30–43.CrossRef Google Scholar

Ko, Y., Park, J. and Seo, J. 2004. Improving text categorization using the importance of sentences. Information Processing and Management 40(1): 65–79.CrossRef Google Scholar

Ko, Y. 2015. A new term-weighting scheme for text classification using the odds of positive and negative class probabilities. Journal of the Association for Information Science and Technology 66(12): 2553–2565.CrossRef Google Scholar

Kübler, S., Liu, C. and Sayyed, Z.A. 2018. To use or not to use: feature selection for sentiment analysis of highly imbalanced data. Natural Language Engineering 24(1): 3–37.CrossRef Google Scholar

Lan, M., Sung, S.Y., Low, H.B. and Tan, C.L. 2005. A comparative study on term weighting schemes for text categorization. In Proceedings of the International Joint Conference on Neural Networks, pp. 546–551.Google Scholar

Lan, M., Tan, C.L., Su, J. and Lu, Y. 2009. Supervised and traditional term weighting methods for automatic text categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(4): 721–735.CrossRef Google Scholar PubMed

Lee, C. and Lee, G.G. 2006. Information gain and divergence-based feature selection for machine learning-based text categorization. Information Processing and Management 42(1): 155–165.CrossRef Google Scholar

Liu, X.Y. and Zhou, Z.H. 2006. The influence of class imbalance on cost-sensitive learning: An empirical study. In Sixth International Conference on Data Mining, IEEE, pp. 970–974.CrossRef Google Scholar

Liu, Y., Loh, H.T., Kamal, Y.-T. and Tor, S.B. 2007. Handling of imbalanced data in text classification: Category-based term weights. In Natural Language Processing and Text Mining, Springer, pp. 171–192.CrossRef Google Scholar

Liu, Y., Loh, H.T. and Sun, A. 2009. Imbalanced text classification: a term weighting approach. Expert Systems with Applications 36(1): 690–701.CrossRef Google Scholar

Lustgarten, J.L., Gopalakrishnan, V., Grover, H. and Visweswaran, S. 2008. Improving classification performance with discretization on biomedical datasets. In AMIA Annual Symposium Proceedings, American Medical Informatics Association, pp. 445–449.Google Scholar

Maloof, M.A. 2003. Learning when data sets are imbalanced and when costs are unequal and unknown. In Proceedings of the ICML 2003 Workshop on Learning from Imbalanced Data Sets.Google Scholar

McHugh, M.L. 2012. The chi-square test of independence. Biochemia Medica 23(2): 143–149.Google Scholar

Moreo, A., Esuli, A. and Sebastiani, F. 2016. Distributional random oversampling for imbalanced text classification. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, ACM, pp. 805–808.Google Scholar

Naderalvojoud, B., Bozkir, A.S. and Sezer, E.A. 2014. Investigation of term weighting schemes in classification of imbalanced texts. In European Conference on Data Mining (ECDM), Lisbon, pp. 15–17.Google Scholar

Naderalvojoud, B., Sezer, E.A. and Ucan, A. 2015. Imbalanced text categorization based on positive and negative term weighting approach. In Text, Speech, and Dialogue, Springer, pp. 325–333.CrossRef Google Scholar

Nguyen, C. H. and Ho, T.B. 2010. Learning imbalanced data with manifold-based sampling. Japan Advanced Institute of Science and Technology https://www.jaist.ac.jp/~bao/WebPapers/ Google Scholar

Ren, F. and Sohrab, M.G. 2013. Class-indexing-based term weighting for automatic text classification. Information Sciences 236: 109–125.CrossRef Google Scholar

Robertson, S. 2004. Understanding inverse document frequency: On theoretical arguments for IDF. Journal of Documentation 60(5): 503–520.CrossRef Google Scholar

Salton, G. and Buckley, C. 1988. Term-weighting approaches in automatic text retrieval. Information Processing and Management 24(5): 513–523.CrossRef Google Scholar

Sebastiani, F. 2002. Machine learning in automated text categorization. ACM computing surveys (CSUR) 34(1): 1–47.CrossRef Google Scholar

Soucy, P. and Mineau, G.W. 2005. Beyond TFIDF weighting for text categorization in the vector space model. In IJCAI International Joint Conference on Artificial Intelligence, pp. 1130–1135.Google Scholar

Sun, A., Lim, E.-P., Benatallah, B. and Hassan, M. 2006. FISA: feature-based instance selection for imbalanced text classification. In Advances in Knowledge Discovery and Data Mining, Springer, pp. 250–254.CrossRef Google Scholar

Sun, A., Lim, E.-P. and Liu, Y. 2009. On strategies for imbalanced text classification using SVM: a comparative study. Decision Support Systems 48(1): 191–201.CrossRef Google Scholar

Taşcı, E. and Güngör, T. 2013. Comparison of text feature selection policies and using an adaptive framework. Expert Systems with Applications 40(12): 4871–4886.CrossRef Google Scholar

Trstenjak, B., Mikac, S. and Donko, D. 2014. KNN with TF-IDF based framework for text categorization. Procedia Engineering 69: 1356–1364.CrossRef Google Scholar

Uysal, A.K. 2016. An improved global feature selection scheme for text classification. Expert Systems with Applications 43: 82–92.CrossRef Google Scholar

Yang, Y. and Pedersen, J.O. 1997. A comparative study on feature selection in text categorization. In ICML, 97: 412–420.Google Scholar

Yang, J., Liu, Y., Zhu, X., Liu, Z. and Zhang, X. 2012. A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. Information Processing and Management 48(4): 741–754.CrossRef Google Scholar

Yin, L., Ge, Y., Xiao, K., Wang, X. and Quan, X. 2013. Feature selection for high-dimensional imbalanced data. Neurocomputing 105: 3–11.CrossRef Google Scholar

Zheng, Z., Wu, X. and Srihari, R. 2004. Feature selection for text categorization on imbalanced data. ACM SIGKDD Explorations Newsletter 6(1): 80–89.CrossRef Google Scholar

Article contents

Term evaluation metrics in imbalanced text categorization

Abstract

Keywords

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests