Skip to main content Accessibility help
×
Hostname: page-component-76fb5796d-skm99 Total loading time: 0 Render date: 2024-04-27T01:14:11.564Z Has data issue: false hasContentIssue false

12 - Fundamental Limits in Model Selection for Modern Data Analysis

Published online by Cambridge University Press:  22 March 2021

Miguel R. D. Rodrigues
Affiliation:
University College London
Yonina C. Eldar
Affiliation:
Weizmann Institute of Science, Israel
Get access

Summary

With rapid development in hardware storage, precision instrument manufacturing, and economic globalization etc., data in various forms have become ubiquitous in human life. This enormous amount of data can be a double-edged sword. While it provides the possibility of modeling the world with a higher fidelity and greater flexibility, improper modeling choices can lead to false discoveries, misleading conclusions, and poor predictions. Typical data-mining, machine-learning, and statistical-inference procedures learn from and make predictions on data by fitting parametric or non-parametric models. However, there exists no model that is universally suitable for all datasets and goals. Therefore, a crucial step in data analysis is to consider a set of postulated candidate models and learning methods (the model class) and select the most appropriate one. We provide integrated discussions on the fundamental limits of inference and prediction based on model-selection principles from modern data analysis. In particular, we introduce two recent advances of model-selection approaches, one concerning a new information criterion and the other concerning modeling procedure selection.

Type
Chapter
Information
Publisher: Cambridge University Press
Print publication year: 2021

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Greenland, S., “Modeling and variable selection in epidemiologic analysis,” Am. J. Public Health, vol. 79, no. 3, pp. 340–349, 1989.Google Scholar
Andersen, C. M. and Bro, R., “Variable selection in regression – a tutorial,” J. Chemometrics, vol. 24, nos. 11–12, pp. 728–737, 2010.Google Scholar
Johnson, J. B. and Omland, K. S., “Model selection in ecology and evolution,” Trends Ecology Evolution, vol. 19, no. 2, pp. 101–108, 2004.Google Scholar
Stoica, P. and Selen, Y., “Model-order selection: A review of information criterion rules,” IEEE Signal Processing Mag., vol. 21, no. 4, pp. 36–47, 2004.Google Scholar
Kadane, J. B. and Lazar, N. A., “Methods and criteria for model selection,” J. Amer. Statist. Assoc., vol. 99, no. 465, pp. 279–290, 2004.CrossRefGoogle Scholar
Ding, J., Tarokh, V., and Yang, Y., “Model selection techniques: An overview,” IEEE Signal Processing Mag., vol. 35, no. 6, pp. 16–34, 2018.Google Scholar
Tibshirani, R., “Regression shrinkage and selection via the LASSO,” J. Roy. Statist. Soc. Ser. B, vol. 58, no. 1, pp. 267–288, 1996.Google Scholar
Shao, J., “An asymptotic theory for linear model selection,” Statist. Sinica, vol. 7, no. 2, pp. 221–242, 1997.Google Scholar
Rao, C. R., “Information and the accuracy attainable in the estimation of statistical parameters,” in Breakthroughs in statistics. Springer, 1992, pp. 235–247.Google Scholar
Ding, J., Tarokh, V., and Yang, Y., “Optimal variable selection in regression models,” http://jding. org/jie-uploads/2017/11/regression.pdf, 2016.Google Scholar
Nan, Y. and Yang, Y., “Variable selection diagnostics measures for high-dimensional regression,” J. Comput. Graphical Statist., vol. 23, no. 3, pp. 636–656, 2014.Google Scholar
Ioannidis, J. P., “Why most published research findings are false,” PLoS Medicine, vol. 2, no. 8, p. e124, 2005.Google Scholar
Shibata, R., “Asymptotically efficient selection of the order of the model for estimating parameters of a linear process,” Annals Statist., vol. 8, no. 1, pp. 147–164, 1980.Google Scholar
Yang, Y., “Comparing learning methods for classification,” Statist. Sinica, vol. 16, no. 2, pp. 635–657, 2006.Google Scholar
Ding, J., Tarokh, V., and Yang, Y., “Bridging AIC and BIC: A new criterion for autoregression,” IEEE Trans. lnformation Theory, vol. 64, no. 6, pp. 4024–4043, 2018.Google Scholar
Akaike, H., “A new look at the statistical model identification,” IEEE Trans. Automation Control, vol. 19, no. 6, pp. 716–723, 1974.Google Scholar
Akaike, H., “Information theory and an extension of the maximum likelihood principle,” in Selected papers of Hirotugu Akaike. Springer, 1998, pp. 199–213.Google Scholar
Schwarz, G., “Estimating the dimension of a model,” Annals Statist., vol. 6, no. 2, pp. 461–464, 1978.Google Scholar
Gelman, A., Stern, H. S., Carlin, J. B., Dunson, D. B., Vehtari, A., and Rubin, D. B., Bayesian data analysis. Chapman and Hall/CRC, 2013.Google Scholar
Van, A. W. der Vaart, Asymptotic statistics. Cambridge University Press, 1998, vol. 3.Google Scholar
Liu, J. S., Monte Carlo strategies in scientific computing. Springer Science & Business Media, 2008.Google Scholar
Allen, D. M., “The relationship between variable selection and data agumentation and a method for prediction,” Technometrics, vol. 16, no. 1, pp. 125–127, 1974.Google Scholar
Geisser, S., “The predictive sample reuse method with applications,” J. Amer. Statist. Assoc., vol. 70, no. 350, pp. 320–328, 1975. Fundamental Limits in Model Selection for Modern Data Analysis 381Google Scholar
Burman, P., “A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods,” Biometrika, vol. 76, no. 3, pp. 503–514, 1989.CrossRefGoogle Scholar
Shao, J., “Linear model selection by cross-validation,” J. Amer. Statist. Assoc., vol. 88, no. 422, pp. 486–494, 1993.Google Scholar
Zhang, P., “Model selection via multifold cross validation,” Annals Statist., vol. 21, no. 1, pp. 299–313, 1993.Google Scholar
Stone, M., “An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion,” J. Roy. Statist. Soc. Ser. B, pp. 44–47, 1977.Google Scholar
Hurvich, C. M. and Tsai, C.-L., “Regression and time series model selection in small samples,” Biometrika, vol. 76, no. 2, pp. 297–307, 1989.Google Scholar
Craven, P. and Wahba, G., “Smoothing noisy data with spline functions,” Numerische Mathematik, vol. 31, no. 4, pp. 377–403, 1978.Google Scholar
Akaike, H., “Fitting autoregressive models for prediction,” Ann. Inst. Statist. Math., vol. 21, no. 1, pp. 243–247, 1969.CrossRefGoogle Scholar
Akaike, H., “Statistical predictor identification,” Ann. lnst. Statist. Math., vol. 22, no. 1, pp. 203–217, 1970.Google Scholar
Mallows, C. L., “Some comments on Cp,” Technometrics, vol. 15, no. 4, pp. 661–675, 1973.Google Scholar
Hannan, E. J. and Quinn, B. G., “The determination of the order of an autoregression,” J. Roy. Statist. Soc. Ser. B, vol. 41, no. 2, pp. 190–195, 1979.Google Scholar
Rissanen, J., “Estimation of structure by minimum description length,” Circuits, Systems Signal Processing, vol. 1, no. 3, pp. 395–406, 1982.Google Scholar
Wei, C.-Z., “On predictive least squares principles,” Annals Statist., vol. 20, no. 1, pp. 1–42, 1992.Google Scholar
Nishii, R. et al., “Asymptotic properties of criteria for selection of variables in multiple regression,” Annals Statist., vol. 12, no. 2, pp. 758–765, 1984.CrossRefGoogle Scholar
Rao, R. and Wu, Y., “A strongly consistent procedure for model selection in a regression problem,” Biometrika, vol. 76, no. 2, pp. 369–374, 1989.CrossRefGoogle Scholar
Barron, A., Birgé, L., and Massart, P., “Risk bounds for model selection via penalization,” Probability Theory Related Fields, vol. 113, no. 3, pp. 301–413, 1999.Google Scholar
Shibata, R., “Selection of the order of an autoregressive model by Akaike’s information criterion,” Biometrika, vol. 63, no. 1, pp. 117–126, 1976.Google Scholar
Shibata, R., “An optimal selection of regression variables,” Biometrika, vol. 68, no. 1, pp. 45–54, 1981.Google Scholar
Yang, Y., “Can the strengths of AIC and BIC be shared? A conflict between model indentification and regression estimation,” Biometrika, vol. 92, no. 4, pp. 937–950, 2005.CrossRefGoogle Scholar
Wilks, S. S., “The large-sample distribution of the likelihood ratio for testing composite hypotheses,” Annals Math. Statist., vol. 9, no. 1, pp. 60–62, 1938.Google Scholar
Ing, C.-K., “Accumulated prediction errors, information criteria and optimal forecasting for autoregressive time series,” Annals Statist., vol. 35, no. 3, pp. 1238–1277, 2007.Google Scholar
Yang, Y., “Prediction/estimation with simple linear models: Is it really that simple?Economic Theory, vol. 23, no. 1, pp. 1–36, 2007.Google Scholar
Liu, W. and Yang, Y., “Parametric or nonparametric? A parametricness index for model selection,” Annals Statist., vol. 39, no. 4, pp. 2074–2102, 2011.Google Scholar
van, T. Erven, Grünwald, P., and De Rooij, S., “Catching up faster by switching sooner: A predictive approach to adaptive estimation with an application to the AIC-BIC dilemma,” J. Roy. Statist. Soc. Ser. B., vol. 74, no. 3, pp. 361–417, 2012.Google Scholar
Box, G. E., Jenkins, G. M., and Reinsel, G. C., Time series analysis: Forecasting and control. John Wiley & Sons, 2011.Google Scholar
Zhang, Y. and Yang, Y., “Cross-validation for selecting a model selection procedure,” J. Econometrics, vol. 187, no. 1, pp. 95–112, 2015.Google Scholar
Scheetz, T. E., Kim, K.-Y. A., Swiderski, R. E., Philp, A. R., Braun, T. A., Knudtson, K. L., Dorrance, A. M., DiBona, G. F., Huang, J., Casavant, T. L., Sheffield, V. C., and Stone, E. M., “Regulation of gene expression in the mammalian eye and its relevance to eye disease,” Proc. Natl. Acad. Sci. USA, vol. 103, no. 39, pp. 14429–14434, 2006.CrossRefGoogle ScholarPubMed
Huang, J., Ma, S., and C.-H. Zhang, “Adaptive lasso for sparse high-dimensional regression models,,” Statist. Sinica, pp. 1603–1618, 2008.Google Scholar

Save book to Kindle

To save this book to your Kindle, first ensure coreplatform@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats
×