Skip to main content Accessibility help
×
Home
Hostname: page-component-747cfc64b6-5dv6l Total loading time: 0.207 Render date: 2021-06-16T15:22:09.790Z Has data issue: true Feature Flags: { "shouldUseShareProductTool": true, "shouldUseHypothesis": true, "isUnsiloEnabled": true, "metricsAbstractViews": false, "figures": true, "newCiteModal": false, "newCitedByModal": true, "newEcommerce": true }

Is p-value $<$ 0.05 enough? A study on the statistical evaluation of classifiers

Published online by Cambridge University Press:  27 November 2020

Nadine M. Neumann
Affiliation:
Instituto de Computação, Universidade Federal Fluminense, Niterói, RJ, Brazil e-mails: nadinemelloni@id.uff.br, plastino@ic.uff.br
Alexandre Plastino
Affiliation:
Instituto de Computação, Universidade Federal Fluminense, Niterói, RJ, Brazil e-mails: nadinemelloni@id.uff.br, plastino@ic.uff.br
Jony A. Pinto Junior
Affiliation:
Departamento de Estatística, Universidade Federal Fluminense, Niterói, RJ, Brazil e-mail: jarrais@id.uff.br
Alex A. Freitas
Affiliation:
School of Computing, University of Kent, Canterbury, Kent, UK e-mail: a.a.freitas@kent.ac.uk

Abstract

Statistical significance analysis, based on hypothesis tests, is a common approach for comparing classifiers. However, many studies oversimplify this analysis by simply checking the condition p-value < 0.05, ignoring important concepts such as the effect size and the statistical power of the test. This problem is so worrying that the American Statistical Association has taken a strong stand on the subject, noting that although the p-value is a useful statistical measure, it has been abusively used and misinterpreted. This work highlights problems caused by the misuse of hypothesis tests and shows how the effect size and the power of the test can provide important information for better decision-making. To investigate these issues, we perform empirical studies with different classifiers and 50 datasets, using the Student’s t-test and the Wilcoxon test to compare classifiers. The results show that an isolated p-value analysis can lead to wrong conclusions and that the evaluation of the effect size and the power of the test contributes to a more principled decision-making.

Type
Research Article
Copyright
© The Author(s), 2020. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below.

References

Barros, E. A. C. & Mazucheli, J. 2005. Um estudo sobre o tamanho e poder dos testes t-student e wilcoxon. Acta Scientiarum: Technology 27(1), 2332.Google Scholar
Benavoli, A., Corani, G., Demšar, J. & Zaffalon, M. 2017. Time for a change: a tutorial for comparing multiple classifiers through bayesian analysis. Journal of Machine Learning Research 18(1), 136.Google Scholar
Berben, L., Sereika, S. M. & Engberg, S. 2012. Effect size estimation: methods and examples. International Journal of Nursing Studies 49(8), 10391047.CrossRefGoogle ScholarPubMed
Bertsimas, D. & Dunn, J. 2017. Optimal classification trees. Machine Learning 106(7), 10391082.CrossRefGoogle Scholar
Breiman, L. 2001. Random forests. Machine Learning 45(1), 532.CrossRefGoogle Scholar
Bussab, W. O. & Morettin, P. 2010. Estatística Básica, 6a. edição. Editora Saraiva.Google Scholar
Cardoso, D. O., Gama, J. & França, F. M. 2017. Weightless neural networks for open set recognition. Machine Learning 106(9–10), 15471567.CrossRefGoogle Scholar
Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences, 2nd edition. Erlbaum.Google Scholar
Cousins, S. & Taylor, J. S. 2017. High-probability minimax probability machines. Machine Learning 106(6), 863886.CrossRefGoogle Scholar
Cover, T. & Hart, P. 1967. Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13(1), 2127.CrossRefGoogle Scholar
Dheeru, D. & Taniskidou, E. K. 2017. UCI machine learning repository. http://archive.ics.uci.edu/ml.Google Scholar
du Plessis, M. C., Niu, G. & Sugiyama, M. 2017. Class-prior estimation for learning from positive and unlabeled data. Machine Learning 106(4), 463492.CrossRefGoogle Scholar
Fern, E. F. & Monroe, K. B. 1996. Effect-size estimates: issues and problems in interpretation. Journal of Consumer Research 23(2), 89105.CrossRefGoogle Scholar
Fisher, R. A. 1925. Statistical Methods for Research Workers. Springer.Google Scholar
Fritz, C. O., Morris, P. E. & Richler, J. J. 2012. Effect size estimates: current use, calculations, and interpretation. Journal of Experimental Psychology: General 141(1), 218.CrossRefGoogle ScholarPubMed
Gomes, H. M., Bifet, A., Read, J., Barddal, J. P., Enembreck, F., Pfharinger, B., Holmes, G. & Abdessalem, T. 2017. Adaptive random forests for evolving data stream classification. Machine Learning 106(9–10), 14691495.CrossRefGoogle Scholar
Hair, J. F., Black, W. C., Babin, B. J., Anderson, R. E. & Tatham, R. L. 2009. Análise multivariada de dados. Bookman Editora.Google Scholar
Hearst, M. A., Dumais, S. T., Osuna, E., Platt, J. & Scholkopf, B. 1998. Support vector machines. IEEE Intelligent Systems and Their Applications 13(4), 1828.CrossRefGoogle Scholar
Huang, K. H. & Lin, H. T. 2017. Cost-sensitive label embedding for multi-label classification. Machine Learning 106(9–10), 17251746.CrossRefGoogle Scholar
Japkowicz, N. & Shah, M. 2011. Evaluating Learning Algorithms: A Classification Perspective. Cambridge University Press.CrossRefGoogle Scholar
Júnior, P. R. M., de Souza, R. M., Werneck, R. d. O., Stein, B. V., Pazinato, D. V., de Almeida, W. R., Penatti, O. A., Torres, R. d. S. & Rocha, A. 2017. Nearest neighbors distance ratio open-set classifier. Machine Learning 106(3), 359386.CrossRefGoogle Scholar
Kim, D. & Oh, A. 2017. Hierarchical dirichlet scaling process. Machine Learning 106(3), 387418.CrossRefGoogle Scholar
Kline, R. B. 2004. Beyond Significance Testing: Reforming Data Analysis Methods in Behavioral Research. American Psychological Association.CrossRefGoogle Scholar
Kotłowski, W. & Dembczyński, K. 2017. Surrogate regret bounds for generalized classification performance metrics. Machine Learning 106(4), 549572.CrossRefGoogle Scholar
Krijthe, J. H. & Loog, M. 2017. Projected estimators for robust semi-supervised classification. Machine Learning 106(7), 9931008.CrossRefGoogle Scholar
Langley, P., Iba, W., Thompson, K. 1992. An analysis of Bayesian classifiers. In Proceedings of the Tenth National Conference on Artificial Intelligence (AAAI-92), California, AAAI Press, 90, 223228.Google Scholar
Mena, D., Montañés, E., Quevedo, J. R. & Del Coz, J. J. 2017. A family of admissible heuristics for a* to perform inference in probabilistic classifier chains. Machine Learning 106(1), 143169.CrossRefGoogle Scholar
Journal, ML. 2017. Machine Learning 106(1–12). https://link.springer.com/journal/10994/106/1 Google Scholar
Nakagawa, S. & Cuthill, I. C. 2007. Effect size, confidence interval and statistical significance: a practical guide for biologists. Biological Reviews 82(4), 591605.CrossRefGoogle ScholarPubMed
Neumann, N. M., Plastino, A., Junior, J. A. P. & Freitas, A. A. 2018. Is p-value< 0.05 enough? two case studies in classifiers evaluation (in Portuguese). In Anais do XV Encontro Nacional de Inteligência Artificial e Computacional, SBC, 94103.Google Scholar
Osojnik, A., Panov, P. & Džeroski, S. 2017. Multi-label classification via multi-target regression on data streams. Machine Learning 106(6), 745770.CrossRefGoogle Scholar
Snyder, P. & Lawson, S. 1993. Evaluating results using corrected and uncorrected effect size estimates. The Journal of Experimental Education 61(4), 334349.CrossRefGoogle Scholar
Sullivan, G. M. & Feinn, R. 2012. Using effect size-or why the p-value is not enough. Journal of Graduate Medical Education 4(3), 279282.CrossRefGoogle ScholarPubMed
Suzumura, S., Ogawa, K., Sugiyama, M., Karasuyama, M. & Takeuchi, I. 2017. Homotopy continuation approaches for robust SV classification and regression. Machine Learning 106(7), 10091038.CrossRefGoogle Scholar
Tomczak, M. & Tomczak, E. 2014. The need to report effect size estimates revisited. an overview of some recommended measures of effect size. Trends in Sport Sciences 21(1), 1925.Google Scholar
Wasserstein, R. L. & Lazar, N. A. 2016. The ASA’s statement on p-values: context, process, and purpose. The American Statistician 70, 129133.CrossRefGoogle Scholar
Witten, I. H., Frank, E., Hall, M. A. & Pal, C. J. 2016. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann.Google Scholar
Wu, Y. P. & Lin, H. T. 2017. Progressive random k-labelsets for cost-sensitive multi-label classification. Machine Learning 106(5), 671694.CrossRefGoogle Scholar
Xuan, J., Lu, J., Zhang, G., Da Xu, R. Y. & Luo, X. 2017. A Bayesian nonparametric model for multi-label learning. Machine Learning 106(11), 17871815.CrossRefGoogle Scholar
Yu, F. & Zhang, M. L. 2017. Maximum margin partial label learning. Machine Learning 106(4), 573593.CrossRefGoogle Scholar
Zaidi, N. A., Webb, G. I., Carman, M. J., Petitjean, F., Buntine, W., Hynes, M. & De Sterck, H. 2017. Efficient parameter learning of Bayesian network classifiers. Machine Learning 106(9–10), 12891329.CrossRefGoogle Scholar

Send article to Kindle

To send this article to your Kindle, first ensure no-reply@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about sending to your Kindle. Find out more about sending to your Kindle.

Note you can select to send to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be sent to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Is p-value $<$ 0.05 enough? A study on the statistical evaluation of classifiers
Available formats
×

Send article to Dropbox

To send this article to your Dropbox account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your <service> account. Find out more about sending content to Dropbox.

Is p-value $<$ 0.05 enough? A study on the statistical evaluation of classifiers
Available formats
×

Send article to Google Drive

To send this article to your Google Drive account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your <service> account. Find out more about sending content to Google Drive.

Is p-value $<$ 0.05 enough? A study on the statistical evaluation of classifiers
Available formats
×
×

Reply to: Submit a response

Please enter your response.

Your details

Please enter a valid email address.

Conflicting interests

Do you have any conflicting interests? *