Skip to main content Accessibility help
×
Hostname: page-component-77c89778f8-gvh9x Total loading time: 0 Render date: 2024-07-19T11:23:18.923Z Has data issue: false hasContentIssue false

Bibliography

Published online by Cambridge University Press:  05 August 2011

Nathalie Japkowicz
Affiliation:
American University, Washington DC
Mohak Shah
Affiliation:
McGill University, Montréal
Get access

Summary

Image of the first page of this content. For PDF version, please use the ‘Save PDF’ preceeding this image.'
Type
Chapter
Information
Evaluating Learning Algorithms
A Classification Perspective
, pp. 393 - 402
Publisher: Cambridge University Press
Print publication year: 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Adams, N. M. and Hand, D. J.. Comparing classifiers when the misallocation costs are uncertain. Pattern Recognition, 32:1139–1147, 1999.CrossRefGoogle Scholar
Aha, D.. Generalizing from case studies: A case study. In Proceedings of the 9th International Workshop on Machine Learning (ICML '92), pp. 1–10. Morgan Kaufmann, San Mateo, CA, 1992.Google Scholar
Alaiz-Rodríguez, R. and Japkowicz, N.. Assessing the impact of changing environments on classifier performance. In Proceedings of the 21st Canadian Conference in Artificial Intelligence (AI 2008), Springer, New York, 2008.Google Scholar
Alaiz-Rodríguez, R., Japkowicz, N., and Tischer, P.. Visualizing classifier performance on different domains. In Proceedings of the 2008 20th IEEE International Conference on Tools with Artificial Intelligence (ICTAI '08), pp. 3–10. IEEE Computer Society, Washington, D.C., 2008.Google Scholar
Ali, S. and Smith, K. A.. Kernel width selection for svm classification: A meta learning approach. International Journal of Data Warehousing Mining, 1:78–97, 2006.CrossRefGoogle Scholar
Alpaydn, E.. Combined 52 f test for comparing supervised classification learning algorithms. Neural Computation, 11:1885–1892, 1999.CrossRefGoogle Scholar
Andersson, A., Davidsson, P., and Linden, J.. Measure-based classifier performance evaluation. Pattern Recognition Letters, 20:1165–1173, 1999.CrossRefGoogle Scholar
Armstrong, J. S.. Significance tests harm progress in forecasting. International Journal of Forecasting, 23:321–327, 2007.CrossRefGoogle Scholar
Asuncion, A. and Newman, D. J.. UCI machine learning repository. University of California, Irvine, School of Information and Computer Science, 2007. URL: http://www.ics. uci.edu/ mlearn/MLRepository.html.
Bailey, T. L. and Elkan, C.. Estimating the accuracy of learned concepts. In Proceedings of the 1993 International Joint Conference on Artificial Intelligence, pp. 895–900. Morgan Kaufmann, San Mateo, CA, 1993.Google Scholar
Bay, S. D., Kibler, D., Pazzani, M. J., and Smyth, P.. The UCI KDD archive of large data sets for data mining researc and experimentation. SIGKDD Explorations, 2(2):81–85, December 2000.CrossRefGoogle Scholar
Bellinger, C., Lalonde, J., Floyd, M. W., Mallur, V., Elkanzi, E., Ghazi, D., He, J., Mouttham, A., Scaiano, M., Wehbe, E., and Japkowicz, N.. An evaluation of the value added by informative metrics. In Proceedings of the Fourth Workshop on Evaluation Methods for Machine Learning, 2009.Google Scholar
Bennett, E. M., Alpert, R., and Goldstein, A. C.. Communications through limited response questioning. Public Opinion Q, 18:303–308, 1954.CrossRefGoogle Scholar
Berry, K. J. and Mielke, P. W. Jr., A generalization of Cohen's kappa agreement measure to interval measurement and multiple raters. Educational and Psychological Measurements, 48:921–933, 1988.CrossRefGoogle Scholar
Blum, A., Kalai, A., and Langford, J.. Beating the hold-out: Bounds for k-fold and progressive cross-validation. In Proceedings of the 12th Annual Conference on Computational Learning Theory (COLT '99), pp. 203–208. Association for Computing Machinery, New York, 1999. doi: http://doi.acm.org/10.1145/307400.307439.CrossRefGoogle Scholar
Bouckaert, R. R.. Choosing between two learning algorithms based on calibrated tests. In Fawcett, T. and Mishra, N., editors, Proceedings of the 20th International Conference on Machine Learning. American Association for Artificial Intelligence, Menlo Park, CA, 2003.Google Scholar
Bouckaert, R. R.. Estimating replicability of classifier learning experiments. In Brodley, C., editor, Proceedings of the 21st International Conference on Machine Learning. American Association for Artificial Intelligence, Menlo Park, CA, 2004.Google Scholar
Bousquet, O., Boucheron, S., and Lugosi, G.. Introduction to statistical learning theory. In Advanced Lectures on Machine Learning, pp. 169–207. Vol. 3176 of Springer Lecture Notes in Artificial Intelligence. Springer-Verlag, Berlin, 2004.CrossRefGoogle Scholar
Bradford, J. P., Kunz, C., Kohavi, R., Brunk, C., and Brodley, C. E.. Pruning decision trees with misclassification costs. In Proceedings of the European Conference on Machine Learning, pp. 131–136. Springer, Berlin, 1998.Google Scholar
Bradley, P.. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30:1145–1159, 1997.CrossRefGoogle Scholar
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J.. Classification and Regression Trees. Chapman & Hall, CRC, 1984.Google Scholar
Brodley, C. E.. Addressing the selective superiority problem: Automatic algorithm/model class selection. In Proceedings of the 10th International Conference on Machine Learning, pp. 17–24, Morgan Kaufmann, San Mateo, CA, 1993.Google Scholar
Buja, A., Stuetzle, W., and Shen, Y.. Loss functions for binary class probability estimation: Structure and applications. 2005. http://www-stat.wharton.upenn.edu/ buja/PAPERS/paper-proper-scoring.pdf.
Busemeyer, J. R. and Wang, Y. M.. Model comparisons and model selections based on generalization test methodology. Journal of Mathematical Psychology, 44:171–189, 2000.CrossRefGoogle Scholar
Byrt, T., Bishop, J., and Carlin, J. B.. Bias, prevalence and kappa. Journal of Clinical Epidemiology, 46:423–429, 1993.CrossRefGoogle ScholarPubMed
Caruana, R. and Niculescu-Mizil, A.. Data mining in metric space: An empirical analysis of supervised learning performance criteria. In Proceedings of KDD. Association for Computing Machinery, New York, 2004.Google Scholar
Chernik, M. R.. Bootstrap Methods: A Guide for Practitioners and Researchers. 2nd ed. Wiley, New York, 2007.CrossRefGoogle Scholar
Chow, S. L.. Precis of statistical significance: Rationale, validity, and utility. Behavioral And Brain Sciences, 21:169–239, 1998.CrossRefGoogle ScholarPubMed
Ciraco, M., Rogalewski, M., and Weiss, G.. Improving classifier utility by altering the misclassification cost ratio. In Proceedings of the 1st International Workshop on Utility-Based Data Mining (UBDM '05), pp. 46–52. Association for Computing Machinery, New York, 2005.CrossRefGoogle Scholar
Cohen, J.. A coefficient of agreement for nominal scales. Educational and Psychological Measurements, 20:37–46, 1960.CrossRefGoogle Scholar
Cohen, J.. The earth is round (p!.05). American Psychologist, 49:997–1003, 1994.CrossRefGoogle Scholar
Cohen, J.. The earth is round (p!.05). In Harlow, L. L. and Mulaik, S. A., editors, What If There Were No Significance Tests?Lawrence Erlbaum, Mahwah, NJ, 1997.Google Scholar
Cohen, P. R.. Empirical Methods for Artificial Intelligence. MIT Press, Cambridge, MA, 1995.Google Scholar
Conover, W. J.. Practical Nonparametric Statistics. 3rd ed. Wiley, New York, 1999.Google Scholar
Cortes, C. and Mohri, M.. AUC optimization vs. error rate minimization. In Advances in Neural Information Processing Systems, Vol. 16. MIT Press, Cambridge, MA, 2004.Google Scholar
Cortes, C. and Mohri, M.. Confidence intervals for the area under the ROC curve. In Advances in Neural Information Processing Systems, Vol. 17. MIT Press, Cambridge, MA, 2005.Google Scholar
Davis, J. and Goadrich, M.. The relationship between precision-recall and ROC curves. In Proceedings of the International Conference on Machine Learning, pp. 233–240. Association for Computing Machinery, New York, 2006.Google Scholar
Deeks, J. J. and Altman, D. G.. Diagnostic tests 4: Likelihood ratios. British Medical Journal, 329:168–169, 2004.CrossRefGoogle ScholarPubMed
Demartini, G. and Mizzaro, S.. A classification of IR effectiveness metrics. In Proceedings of the European Conference on Information Retrieval, pp. 488–491. Vol. 3936 of Springer Lecture Notes. Springer, Berlin, 2006.Google Scholar
Demsar, J.. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1–30, 2006.Google Scholar
Demsar, J.. On the appropriateness of statistical tests in machine learning. In Proceedings of the ICML'08 Third Workshop on Evaluation Methods for Machine Learning. Association for Computing Machinery, New York, 2008.Google Scholar
Dice, L. R.. Measures of the amount of ecologic association between species. Journal of Ecology, 26:297–302, 1945.CrossRefGoogle Scholar
Dietterich, T. G.. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10:1895–1924, 1998.CrossRefGoogle ScholarPubMed
Domingos, P.. A unified bias-variance decomposition and its applications. In Proceedings of the 17th International Conference on Machine Learning, pp. 231–238. Morgan Kaufmann, San Mateo, CA, 2000.Google Scholar
Drummond, C.. Machine learning as an experimental science (revised). In Proceedings of the AAAI'06 Workshop on Evaluation Methods for Machine Learning I. American Association for Artificial Intelligence, Menlo Park, CA, 2006.Google Scholar
Drummond, C.. Finding a balance between anarchy and orthodoxy. In Proceedings of the ICML'08 Third Workshop on Evaluation Methods for Machine Learning. Association for Computing Machinery, New York, 2008.Google Scholar
Drummond, C. and Japkowicz, N.. Warning: Statistical benchmarking is addictive. Kicking the habit in machine learning. Journal of Experimental and Theoretical Artificial Intelligence, 22(1):67–80, 2010.CrossRefGoogle Scholar
Efron, B.. Estimating the error rate of a prediction rule: Improvement on cross-validation. Journal of the American Statistical Association, 78:316–331, 1983.CrossRefGoogle Scholar
Efron, B. and Tibshirani, R. J.. An Introduction to the Bootstrap, Chapman and Hall, New York, 1993.
Elazmeh, W., Japkowicz, N., and Matwin, S.. A framework for measuring classification difference with imbalance. In Proceedings of the 2006 European Conference on Machine Learning (ECML/PKDD 2008). Springer, Berlin, 2006.Google Scholar
Fan, W., Stolfo, S. J., Zhang, J., and Chan, P. K.. Adacost: Misclassification cost-sensitive boosting. In Proceedings of the 16th International Conference on Machine Learning, pp. 97–105. Morgan Kaufmann, San Mateo, CA, 1999.Google Scholar
Fawcett, T.. ROC graphs: Notes and practical considerations for data mining researchers. Technical Note HPL 2003–4, Hewlett-Packard Laboratories, 2004.
Fawcett, T.. An introduction to ROC analysis. Pattern Recognition Letters, 27:861–874, 2006.CrossRefGoogle Scholar
Fawcett, T. and Niculescu-Mizil, A.. PAV and the ROC convex hull. Machine Learning, 68 (1):97–106, 2007. doi: http://dx.doi.org/10.1007/s10994-007-5011-0.CrossRef
Ferri, C., Flach, P. A., and Hernandez-Orallo, J.. Improving the AUC of probabilistic estimation trees. In Proceedings of the 14th European Conference on Machine Learning, pp. 121–132. Springer, Berlin, 2003.Google Scholar
Ferri, C., Haernandez-Orallo, J., and Modroiu, R.. An experimental comparison of performance measures for classification. Pattern Recognition Letters, 30:27–38, 2009.CrossRefGoogle Scholar
Fisher, R. A.. Statistical Methods and Scientific Inference. 2nd ed. Hafner, New York, 1959.Google Scholar
Fisher, R. A.. The Design of Experiments. 2nd ed. Hafner, New York, 1960.Google ScholarPubMed
Flach, P. A.. The geometry of ROC space: Understanding machine learning metrics through ROC isometrics. In Proceedings of the 20th International Conference on Machine Learning, pp. 194–201. American Association for Artificial Intelligence, Menlo Park, CA, 2003.Google Scholar
Flach, P. A. and Wu, S.. Repairing concavities in ROC curves. In Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI'05), pp. 702–707. Professional Book Center, 2005.Google Scholar
Fleiss, J. L.. Measuring nominal scale agreement among many raters. Psychological Bulletin, 76:378–382, 1971.CrossRefGoogle Scholar
Forman, G.. A method for discovering the insignificance of one's best classifier and the unlearnability of a classification task. In Proceedings of the First International Workshop on Data Mining Lessons Learned (DMLL-2002), 2002.Google Scholar
Forster, M. R.. Key concepts in model selection: Performance and generalizabilty. Journal of Mathematical Psychology, 44:205–231, 2000.CrossRefGoogle Scholar
Freund, Y., Iyer, R., Schapire, R. E., and Singer, Y.. An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4:933–969, 2003.Google Scholar
Friedman, M.. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, 32:675–701, 1937.CrossRefGoogle Scholar
Friedman, M.. A comparison of alternative tests of significance for the problem of m rankings. Annals of Mathematical Statistics, 11:86–92, 1940.CrossRefGoogle Scholar
Fuernkranz, J. and Flach, P. A.. Roc 'n' rule learning – Towards a better understanding of covering algorithms. Machine Learning, 58:39–77, 2005.CrossRefGoogle Scholar
Ganti, V., Gehrke, J., Ramakrishnan, R., and Loh, W. Y.. A framework for measuring differences in data characteristics. Journal of Computer and System Sciences, 64:542–578, 2002.CrossRefGoogle Scholar
Gardner, M. and Altman, D. G.. Confidence intervals rather than p values: Estimation rather than hypothesis testing. British Medical Journal, 292:746–750, 1986.CrossRefGoogle Scholar
Gaudette, L. and Japkowicz, N.. Evaluation methods for ordinal classification. In Proceedings of the 2009 Canadian Conference on Artificial Intelligence. Springer, New York, 2009.Google Scholar
Geng, L. and Hamilton, H.. Choosing the right lens: Finding what is interesting in data mining. In Guillet, F. and Hamilton, H. J., editors, Quality Measures in Data Mining, pp. 3–24. Vol. 43 of Springer Studies in Computational Intelligence Series, Springer, Berlin, 2007.CrossRefGoogle Scholar
Gigerenzer, G.. Mindless statistics. Journal of Socio-Economics, 33:587–606, 2004.CrossRefGoogle Scholar
Gill, J. and Meir, K.. The insignificance of null hypothesis significance testing. Political Research Quarterly, pp. 647–674, 1999.CrossRefGoogle Scholar
Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., and Lander, E. S.. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286:531–537, 1999.CrossRefGoogle ScholarPubMed
Goodman, S. N.. A comment on replication, p-values and evidence. Statistics in Medicine, 11:875–879, 2007.CrossRefGoogle Scholar
Gosset, W. S. (pen name: Student). The probable error of a mean. Biometrika, 6:1–25, 1908.Google Scholar
Gwet, K.. Kappa statistic is not satisfactory for assessing the extent of agreement between raters. Statistical Methods for Inter-Rater Reliability Assessment Series, 1:1–6, 2002a.Google Scholar
Gwet, K.. Inter-rater reliability: Dependency on trait prevalence and marginal homogeneity. Statistical Methods for Inter-Rater Reliability Assessment Series, 2:1–9, 2002b.Google Scholar
Hand, D. J.. Classifier technology and the illusion of progress. Statistical Science, 21:1–15, 2006.CrossRefGoogle Scholar
Hand, D. J.. Measuring classifier performance: A coherent alternative to the area under the ROC curve. Machine Learning, 77:103–123, 2009.CrossRefGoogle Scholar
Hand, D. J. and Till, R. J.. A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine Learning, 45:171–186, 2001.CrossRefGoogle Scholar
Hanley, J. A. and McNeil, B. J.. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143:29–36, 1982.CrossRefGoogle Scholar
Harlow, L. L. and Mulaik, S. A., editors. What If There Were No Significance Tests?Lawrence Erlbaum, Mahwah, NJ, 1997.Google Scholar
Hastie, T., Tibshirani, R., and Friedman, J.. The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer-Verlag, New York, 2001.CrossRefGoogle Scholar
He, J., Tan, A. H., Tan, C. L., and Sung, S. Y.. On quantitative evaluation of clustering systems. In Wu, W. and Xiong, H., editors, Information Retrieval and Clustering. Kluwer Academic, Dordrecht, The Netherlands, 2002.Google Scholar
He, X. and Frey, E. C.. The meaning and use of the volume under a three-class ROC surface (vus). IEEE Transactions Medical Imaging, 27:577–588, 2008.Google Scholar
Herbrich, R.. Learning Kernel Classifiers. MIT Press, Cambridge, MA, 2002.Google Scholar
Hill, T. and Lewicki, P.. STATISTICS Methods and Applications. StatSoft, Tulsa, OK, 2007.
Hinton, P.. Statistics Explained. Routledge, London, 1995.Google Scholar
Holm, S.. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2):65–70, 1979.Google Scholar
Holte, R. C.. Very simple classification rules perform well on most commonly used data sets. Machine Learning, 11:63–91, 1993.CrossRefGoogle Scholar
Hommel, G.. A stagewise rejective multiple test procedure based on a modified Bonferroni test. Biometrika, 75:383–386, 1988.CrossRefGoogle Scholar
Hope, L. R. and Korb, K. B.. A Bayesian metric for evaluating machine learning algorithms. In Australian Conference on Artificial Intelligence, pp. 991–997. Vol. 3399 of Springer Lecture Notes in Computer Science. Springer, New York, 2004.Google Scholar
Howell, D. C.. Statistical Methods for Psychology. 5th ed. Duxbury Press, Thomson Learning, 2002.Google Scholar
Howell, D. C.. Resampling Statistics: Randomization and the Bootstrap. On-Line Notes, 2007. URL http://www.uvm.edu/ dhowell/StatPp./Resampling/Resampling.html.
Huang, J. and Ling, C. X.. Constructing new and better evaluation measures for machine learning. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI '07), pp. 859–864, 2007.Google Scholar
Huang, J., Ling, C. X., Zhang, H., and Matwin, S.. Proper model selection with significance test. In Proceedings of the European Conference on Machine Learning (ECML-2008), pp. 536–547. Springer, Berlin, 2008.Google Scholar
Hubbard, R. and Lindsay, R. M.. Why p values are not a useful measure of evidence in statistical significance testing. Theory and Psychology, 18:69–88, 2008.CrossRefGoogle Scholar
Ioannidis, J. P. A.. Why most published research findings are false. Public Library of Science Medicine, 2(8):e124, 2005.Google ScholarPubMed
Jaccard, P.. The distribution of the flora in the alpine zone. New Phytology, 11(2):37–50, 1912.Google Scholar
Jain, A. K., Dubes, R. C., and Chen, C.. Bootstrap techniques for error estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 9:628–633, 1987.CrossRefGoogle ScholarPubMed
Japkowicz, N.. Classifier evaluation: A need for better education and restructuring. In Proceedings of the ICML'08 Third Workshop on Evaluation Methods for Machine Learning, July 2008.Google Scholar
Japkowicz, N., Sanghi, P., and Tischer, P.. A projection-based framework for classifier performance evaluation. In Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD '08) – Part I, pp. 548–563. Springer-Verlag, Berlin, 2008.Google Scholar
Jensen, D. and Cohen, P.. Multiple comparisons in induction algorithms. Machine Learning, 38:309–338, 2000.CrossRefGoogle Scholar
Jin, H. and Lu, Y.. Permutation test for non-inferiority of the linear to the optimal combination of multiple tests. Statistics and Probability Letters: 79:664–669, 2009.CrossRefGoogle ScholarPubMed
Kendall, M.. A new measure of rank correlation. Biometrika, 30:81–89, 1938.CrossRefGoogle Scholar
Kibler, D. F. and Langley, P.. Machine learning as an experimental science. In Proceedings of the Third European Working Session on Learning (EWSL), pp. 81–92. Pitman, New York, 1988.Google Scholar
Klement, W.. Evaluating machine learning methods: Scored receiver operating characteristics (sROC) curves. Ph.D. thesis, SITE, University of Ottawa, Canada, May 2010.
Kohavi, R.. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI '95), pp. 1137–1143. Morgan Kaufmann, San Mateo, CA, 1995.Google Scholar
Kononenko, I. and Bratko, I.. Information-based evaluation criterion for classifier's performance. Machine Learning, 6:67–80, 1991.CrossRefGoogle Scholar
Kononenko, I. and Kukar, M.. Machine Learning and Data Mining: Introduction to Principles and Algorithms. Horwood, Chichester, UK, 2007.CrossRefGoogle Scholar
Kraemer, H. C.. Ramifications of a population model for κ as a coefficient of reliability. Psychometrika, 44:461–472, 1979.CrossRefGoogle Scholar
Kruskal, W. J. and Wallis, W. A.. Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association, 47:583–621, 1952.CrossRefGoogle Scholar
Kubat, M., Holte, R. C., and Matwin, S.. Machine learning for the detection of oil spills in satellite radar images. Machine Learning, 30:195–215, 1998.CrossRefGoogle Scholar
Kukar, M., Kononenko, I., and Ljubljana, S.. Reliable classifications with machine learning. In Proceedings of 13th European Conference on Machine Learning (ECML 2002), pp. 219–231. Springer, Berlin, 2002.CrossRefGoogle Scholar
Kukar, M. Z. and Kononenko, I.. Cost-sensitive learning with neural networks. In Proceedings of the 13th European Conference on Artificial Intelligence (ECAI-98), pp. 445–449. Wiley, New York, 1998.Google Scholar
Kuncheva, L. I., Whitaker, C. J., Shipp, C. A., and Duin, R. P. W.. Limits on the majority vote accuracy in classifier fusion. Pattern Analysis and Applications, 6:22–31, 2003.CrossRefGoogle Scholar
Kurtz, A. K.. A research test of Rorschach test. Personnel Psychology, 1:41–53, 1948.CrossRefGoogle Scholar
Lachiche, N. and Flach, P.. Improving accuracy and cost of two-class and multi-class probabilistic classifiers using ROC curves. In Proceedings of the 20th International Conference on Machine Learning, pp. 416–423. American Association for Artificial Intelligence, Menlo Park, CA, 2003.Google Scholar
LaLoudouana, D. and Tarare, M. B.. Data set selection. In Proceedings of the Neural Information Processing System Workshop. MIT Press, Cambridge, MA, 2003.Google Scholar
Landgrebe, T., Pacl'ik, P., Tax, D. J. M., Verzakov, S., and Duin, R. P. W.. Cost-based classifier evaluation for imbalanced problems. In Proceedings of the 10th International Workshop on Structural and Syntactic Pattern Recognition and 5th International Workshop on Statistical Techniques in Pattern Recognition, pp. 762–770. Vol. 3138 of Springer Lecture Notes in Computer Science. Springer-Verlag, Berlin, 2004.Google Scholar
Langford, J.. Tutorial on practical prediction theory for classification. Journal of Machine Learning Research, 3:273–306, 2005.Google Scholar
Lavesson, N. and Davidsson, P.. Towards application-specific evaluation metrics. In Proceedings of the Third Workshop on Evaluation Methods for Machine Learning (ICML'2008). 2008a.Google Scholar
Lavesson, N. and Davidsson, P.. Generic methods for multi-criteria evaluation. In Proceedings of the Eighth SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, Philadelphia, 2008b.Google Scholar
Laviolette, F., Marchand, M., and Shah, M.. Margin-sparsity trade-off for the set covering machine. In Proceedings of the 16th European Conference on Machine Learning (ECML 2005), pp. 206–217. Vol. 3720 of Springer Lecture Notes in Artificial Intelligence. Springer, Berlin, 2005.CrossRefGoogle Scholar
Laviolette, F., Marchand, M., Shah, M., and Shanian, S.. Learning the set covering machine by bound minimization and margin-sparsity trade-off. Machine Learning, 78(1-2):275–301, 2010.CrossRefGoogle Scholar
Lavrač, N., Flach, P., and Zupan, B.. Rule evaluation measures: A unifying view. In Dzeroski, S. and Flach, P., editors, Ninth International Workshop on Inductive Logic Programming (ILP '99), pp. 174–185. Vol. 1634 of Springer Lecture Notes in Computer Science. Springer-Verlag, Berlin, 1999.CrossRefGoogle Scholar
Lebanon, G. and Lafferty, J. D.. Cranking: Combining rankings using conditional probability models on permutations. In Proceedings of the Nineteenth International Conference on Machine Learning (ICML '02), pp. 363–370. Morgan Kaufmann, San Mateo, CA, 2002.Google Scholar
Li, M. and Vitanyi, P.. An Introduction to Kolmogorov Complexity and Its Applications. 2nd ed. Springer-Verlag, New York, 1997.CrossRefGoogle Scholar
Lindley, D. V. and Scott, W.F.. New Cambridge Statistical Tables. 2nd ed. Cambridge University Press, New York, 1984.Google Scholar
Ling, C. X., Huang, J., and Zhang, H.. AUC: A statistically consistent and more discriminating measure than accuracy. In Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence (IJCAI '03), pp. 519–526. Morgan Kaufmann, San Mateo, CA, 2003.Google Scholar
Liu, X. Y. and Zhou, Z. H.. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering, 18:63–77, 2006.Google Scholar
Macskassy, S. A., Provost, F., and Rosset, S.. Pointwise ROC confidence bounds: An empirical evaluation. In Proceedings of the Workshop on ROC Analysis in Machine Learning (ROCML-2005) at ICML '05. 2005.Google Scholar
Marchand, M. and Shah, M.. PAC-Bayes learning of conjunctions and classification of geneexpression data. In Saul, L. K., Weiss, Y., and Bottou, L., editors, Advances in Neural Information Processing Systems, Vol. 17, pp. 881–888. MIT Press, Cambridge, MA, 2005.Google Scholar
Marchand, M. and Shawe-Taylor, J.. The set covering machine. Journal of Machine Learning Reasearch, 3:723–746, 2002.Google Scholar
Margineantu, D. D. and Dietterich, T. G.. Bootstrap methods for the cost-sensitive evaluation of classifiers. In Proceedings of the Seventeenth International Conference on Machine Learning, pp. 583–590. Morgan Kaufmann, San Mateo, CA, 2000.Google Scholar
Marrocco, C., Duin, R. P. W., and Tortorella, F.. Maximizing the area under the ROC curve by pairwise feature combination. Pattern Recognition, 41:1961–1974, 2008.CrossRefGoogle Scholar
Martin, A., Doddington, G., Kamm, T., Ordowski, M., and Przybocki, M.. The DET curve in assessment of detection task performance. Eurospeech, 4:1895–1898, 1997.Google Scholar
Meehl, P. E.. Theory testing in psychology and physics: A methodological paradox. Philosophy of Science, 34:103–115, 1967.CrossRefGoogle Scholar
Melnik, O., Vardi, Y., and Zhang, C.. Mixed group ranks: Preference and confidence in classifier combination. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26:973–981, 2004.CrossRefGoogle ScholarPubMed
Micheals, R. J. and Boult, T. E.. Efficient evaluation of classification and recognition systems. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition: IEEE Computer Society, pp. 50–57. Washington, DC, 2001.Google Scholar
Mitchell, T.. Machine Learning. McGraw-Hill, New York, 1997.Google Scholar
Murua, A.. Upper bounds for error rates of linear combinations of classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24:591–602, 2002. doi: http://dx.doi.org/10.1109/34.1000235.CrossRefGoogle Scholar
Nadeau, C. and Bengio, Y.. Inference for the generalization error. Machine Learning, 52:239–281, 2003.CrossRefGoogle Scholar
Nakhaeizadeh, G. and Schnabl, A.. Development of multi-criteria metrics for evaluation of data mining algorithms. In Proceedings of KDD-97, pp. 37–42. American Association for Artificial Intelligence, Menlo Park, CA, 1997.Google Scholar
Nakhaeizadeh, G. and Schnabl, A.. Towards the personalization of algorithms evaluation in data mining. In Proceedings of KDD-98, pp. 289–293. American Association for Artificial Intelligence, Menlo Park, CA, 1998.Google Scholar
Narasimhamurthy, A. M.. Theoretical bounds of majority voting performance for a binary classification problem. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27:1988–1995, 2005. doi: http://dx.doi.org/10.1109/TPAMI.2005.249.CrossRefGoogle ScholarPubMed
Narasimhamurthy, A. M. and Kuncheva, L. I.. A framework for generating data to simulate changing environments. In Proceedings of the 2007 Conference on Artificial Intelligence and Applications, pp. 415–420. ACTA Press, 2007.Google Scholar
Neville, J. and Jensen, D.. A bias/variance decomposition for models using collective inference. Machine Learning, 73:87–106, 2008. doi: http://dx.doi.org/10.1007/s10994-008-5066-6.CrossRefGoogle Scholar
O'Brien, D. B., Gupta, M. R., and Gray, R. M.. Cost-sensitive multi-class classification from probability estimates. In ICML 08: Proceedings of the 25th International Conference on Machine Learning (ICML '08), pp. 712–719. Association for Computing achinery, New York, 2008.CrossRefGoogle Scholar
Provost, F. and Domingos, P.. Tree induction for probability-based ranking. Machine Learning, 52:199–215, 2003. doi: http://dx.doi.org/10.1023/A:1024099825458.CrossRefGoogle Scholar
Provost, F., Fawcett, T., and Kohavi, R.. The case against accuracy estimation for comparing induction algorithms. In Proceedings of the 15th International Conference on Machine Learning. Morgan Kaufmann, San Mateo, CA, 1998.Google Scholar
Quenouille, M.. Approximate tests of correlation in time series. Journal of the Royal Statistical Society Series B, 11:18–84, 1949.Google Scholar
,R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2010. URL http://www.R-project.org.
Reich, Y. and Barai, S. V.. Evaluating machine learning models for engineering problems. Artificial Intelligence in Engineering, 13:257–272, 1999.CrossRefGoogle Scholar
Rendell, L. and Cho, H.. Empirical learning as a function of concept character. Machine Learning, 5:267–298, 1990.CrossRefGoogle Scholar
ROCR. Germany, 2007. Web: http://rocr.bioinf.mpi sb.mpg.de/.
Rosset, S.. Model selection via the auc. In Proceedings of the 21st International Conference on Machine Learning. Association for Computing Machinery, New York, 2004.Google Scholar
Rosset, S., Perlich, C., and Zadrozny, B.. Ranking-based evaluation of regression models. Knowledge and Information Systems, 12(3):331–353, 2007.CrossRefGoogle Scholar
Sahiner, B., Chan, H., and Hadjiiski, L.. Classifier performance estimation under the constraint of a finite sample size: Resampling schemes applied to neural network classifiers. Neural Networks, 21:476–483, 2008.CrossRefGoogle ScholarPubMed
Saitta, L. and Neri, F.. Learning in the “real world.” 1998 special issue: applications of machine learning and the knowledge discovery process. Machine Learning, 30:133–163, 1998.CrossRefGoogle Scholar
Salzberg, S. L.. On comparing classifiers: Pitfalls to avoid and a recommeded approach. Data Mining and Knowledge Discovery, 1:317–327, 1997.CrossRefGoogle Scholar
Santos-Rodríguez, R., Guerrero-Curieses, A., Alaiz-Rodríguez, R., and Cid-Sueiro, J.. Costsensitive learning based on Bregman divergences. Machine Learning, 76:271–285, 2009. doi: http://dx.doi.org/10.1007/s10994-009-5132-8.CrossRefGoogle Scholar
Schmidt, F. L.. Statistical significance testing and cumulative knowledge in psychology. Psychological Methods, 1:115–129, 1996.CrossRefGoogle Scholar
Schouten, H. J. A.. Measuring pairwise interobserver agreement when all subjects are judged by the same observers. Statistica Neerlandica, 36:45–61, 1982.CrossRefGoogle Scholar
Scott, W. A.. Reliability of content analysis: The case of nominal scale coding. Public Opinion Q, 19:321–325, 1955.CrossRefGoogle Scholar
Shah, M.. Sample Compression, Margins and Generalization: Extensions to the Set Covering Machine. Ph.D. thesis, SITE, University of Ottawa, Canada, May 2006.
Shah, M.. Sample compression bounds for decision trees. In Proceedings of the 24th International Conference on Machine Learning (ICML '07), pp. 799–806. Association for Computing Machinery, New York, 2007. doi: http://doi.acm.org/10.1145/1273496.1273597.CrossRefGoogle Scholar
Shah, M.. Risk bounds for classifier evaluation: Possibilities and challenges. In Proceedings of the 3rd Workshop on Evaluation Methods for Machine Learning at ICML-2008. 2008.Google Scholar
Shah, M. and Shanian, S.. Hold-out risk bounds for classifier performance evaluation. In Proceedings of the 4th Workshop on Evaluation Methods for Machine Learning at ICML '09. 2009.Google Scholar
Sing, T., Sander, O., Beerenwinkel, N., and Lengauer, T.. ROCR: Visualizing classifier performance in R. Bioinformatics, 21:3940–3941, 2005.CrossRefGoogle Scholar
Smith-Miles, K. A.. Cross-disciplinary perspectives on meta-learning for algorithm selection. ACM Computing Surveys, 41(1):article 6, 2008.CrossRefGoogle Scholar
Soares, C.. Is the UCI repository useful for data mining? In Pires, F. M. and Abreu, S., editors, Proceedings of the 11th Portuguese Conference on Artificial Intelligence (EPIA '03), pp. 209–223. Vol. 2902 of Springer Lecture Notes in Artificial Intelligence. Springer, Berlin, 2003.Google Scholar
Soares, C., Costa, J., and Bradzil, P.. A simple and intuitive mesure for multicriteria evaluation of classification algorithms. In Proceedings of the ECML 2000 Workshop on Meta-Learning: Building Automatic Advice Strategies for Model Selection and Method Combination, pp. 87–96. Springer, Berlin, 2000.Google Scholar
Sonnenburg, S., Braun, M. L., Ong, C. S., Bengio, S., Bottou, L., Holmes, G., LeCun, Y., Mller, K., Pereira, F., Rasmussen, C. E., Ratsch, G., Scholkopf, B., Smola, A., Vincent, P., Weston, J., and Williamson, R.. The need for open source software in machine learning. Journal of Machine Learning Research, 8:2443–2466, 2007.Google Scholar
Spearman, C.. The proof and measurement of association between two things. American Journal of Psychology, 15:72–101, 1904.CrossRefGoogle Scholar
,StatSoft Inc. Electronic Statistics Textbook. URL: http://www.statsoft.com/textbook/stathome.html.
Stocki, T., Japkowicz, N., Ungar, K., Hoffman, I., Yi, J., Li, G., and Siebes, A., editors. Proceedings of the Data Mining Contest, Eighth International Conference on Data Mining. IEEE Computer Society, Washington, D.C., 2008.Google Scholar
Vanderlooy, S. and Hüllermeier, E.. A critical analysis of variants of the AUC. Machine Learning, 72(3):247–262, 2008. doi: http://dx.doi.org/10.1007/s10994-008-5070-x.CrossRefGoogle Scholar
Vapnik, V. and Chapelle, O.. Bounds on Error Expectation for Support Vector Machines. Neural Computation, 12:2013–2036, 2000.CrossRefGoogle ScholarPubMed
Webb, G. I.. Discovering significant patterns. Machine Learning, 68:1–33, 2007. doi: http://dx.doi.org/10.1007/s10994-007-5006-x.CrossRefGoogle Scholar
Weiss, S. M. and Kapouleas, I.. An empirical comparison of pattern recognition, neural nets, and machine learning classification methods. In Proceedings of the 11th International Joint Conference on Artificial Intelligence (IJCAI '89), pp. 781–787, Morgan Kaufmann, San Mateo, CA, 1989.Google Scholar
Weiss, S. M. and Kulikowski, C. A.. Computer Systems That Learn. Morgan Kaufmann, San Mateo, CA, 1991.Google Scholar
Wilcoxon, F.. Individual comparisons by ranking methods. Biometrics, 1:80–83, 1945.CrossRefGoogle Scholar
Witten, I. H. and Frank, E.. Weka 3: Data Mining Software in Java. 2005a. http://www.cs.waikato.ac.nz/ml/weka/.
Witten, I. H. and Frank, E.. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Mateo, CA, 2005b.Google Scholar
Wolpert, D. H.. The lack of a priori distinctions between learning algorithms. Neural Computing, 8:1341–1390, 1996.CrossRefGoogle Scholar
Wolpert, D. H. and Macready, W. G.. No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1:67–82, 1997.CrossRefGoogle Scholar
Wu, S., Flach, P. A., and Ferri, C.. An improved model selection heuristic for AUC. In Proceedings of the 18th European Conference on Machine Learning, Vol. 4701, pp. 478–487. Springer, Berlin, 2007.Google Scholar
Yan, L., Dodier, R., Mozer, M. C., and Wolniewicz, R.. Optimizing classifier performance via the Wilcoxon–Mann–Whitney statistic. In The Proceedings of the International Conference on Machine Learning (ICML), pp. 848–855. American Association for Artificial Intelligence, Menlo Park, CA, 2003.Google Scholar
Yousef, W. A., Wagner, R. F., and Loew, M. H.. Estimating the uncertainty in the estimated mean area under the ROC curve of a classifier. Pattern Recognition Letters, 26:2600–2610, 2005.CrossRefGoogle Scholar
Yousef, W. A., Wagner, R. F., and Loew, M. H.. Assessing classifiers from two independent data sets using ROC analysis: A nonparametric approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28:1809–1817, 2006.CrossRefGoogle ScholarPubMed
Yu, C. H.. Resampling methods: Concepts, applications, and justification. Practical Assessment, Research and Evaluation, 8(19), 2003.Google Scholar
Zadrozny, B. and Elkan, C.. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '02), pp. 694–699. Association for Computing Machinery, New York, 2002. doi: http://doi.acm.org/10.1145/775047.775151.CrossRefGoogle Scholar
Zadrozny, B., Langford, J., and Abe, N.. Cost-sensitive learning by cost-proportionate example weighting. In Proceedings of the 3rd IEEE International Conference on Data Mining, p. 435. IEEE Computer Society, Washington, D.C., 2003.CrossRefGoogle Scholar

Save book to Kindle

To save this book to your Kindle, first ensure coreplatform@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

  • Bibliography
  • Nathalie Japkowicz, American University, Washington DC, Mohak Shah, McGill University, Montréal
  • Book: Evaluating Learning Algorithms
  • Online publication: 05 August 2011
  • Chapter DOI: https://doi.org/10.1017/CBO9780511921803.014
Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

  • Bibliography
  • Nathalie Japkowicz, American University, Washington DC, Mohak Shah, McGill University, Montréal
  • Book: Evaluating Learning Algorithms
  • Online publication: 05 August 2011
  • Chapter DOI: https://doi.org/10.1017/CBO9780511921803.014
Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

  • Bibliography
  • Nathalie Japkowicz, American University, Washington DC, Mohak Shah, McGill University, Montréal
  • Book: Evaluating Learning Algorithms
  • Online publication: 05 August 2011
  • Chapter DOI: https://doi.org/10.1017/CBO9780511921803.014
Available formats
×