Hostname: page-component-848d4c4894-m9kch Total loading time: 0 Render date: 2024-05-15T03:23:51.526Z Has data issue: false hasContentIssue false

Multiple Imputation for Continuous and Categorical Data: Comparing Joint Multivariate Normal and Conditional Approaches

Published online by Cambridge University Press:  04 January 2017

Jonathan Kropko*
Affiliation:
Woodrow Wilson Department of Politics, University of Virginia, 1540 Jefferson Park Avenue, Charlottesville, VA 22903
Ben Goodrich
Affiliation:
Department of Political Science, Columbia University, 420 W. 118th St., Mail Code 3320, New York, NY 10027. e-mail: bg2382@columbia.edu
Andrew Gelman
Affiliation:
Departments of Statistics and Political Science, Columbia University, 1255 Amsterdam Avenue, Room 1016, New York, NY 10027. e-mail: gelman@stat.columbia.edu
Jennifer Hill
Affiliation:
Department of Humanities and Social Sciences, New York University Steinhardt, 246 Greene Street, Room 804, New York, NY 10003. e-mail: jennifer.hill@nyu.edu
*
e-mail: jkropko@virginia.edu (corresponding author)

Abstract

We consider the relative performance of two common approaches to multiple imputation (MI): joint multivariate normal (MVN) MI, in which the data are modeled as a sample from a joint MVN distribution; and conditional MI, in which each variable is modeled conditionally on all the others. In order to use the multivariate normal distribution, implementations of joint MVN MI typically assume that categories of discrete variables are probabilistically constructed from continuous values. We use simulations to examine the implications of these assumptions. For each approach, we assess (1) the accuracy of the imputed values; and (2) the accuracy of coefficients and fitted values from a model fit to completed data sets. These simulations consider continuous, binary, ordinal, and unordered-categorical variables. One set of simulations uses multivariate normal data, and one set uses data from the 2008 American National Election Studies. We implement a less restrictive approach than is typical when evaluating methods using simulations in the missing data literature: in each case, missing values are generated by carefully following the conditions necessary for missingness to be “missing at random” (MAR). We find that in these situations conditional MI is more accurate than joint MVN MI whenever the data include categorical variables.

Type
Research Article
Copyright
Copyright © The Author 2014. Published by Oxford University Press on behalf of the Society for Political Methodology 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

Authors' note: An earlier version of this study was presented at the Annual Meeting of the Society for Political Methodology, Chapel Hill, NC, July 20, 2012. Replication code and data are available on the Political Analysis Dataverse, and the full citation to the replication material is included in the references. We thank Yu-sung Su, Yajuan Si, Sonia Torodova, Jingchen Liu, Michael Malecki, and two anonymous reviewers for their comments.

References

American National Election Studies (ANES; www.electionstudies.org). The ANES 2008 Time Series Study [data set]. Stanford University and the University of Michigan [producers].Google Scholar
Bernaards, Coen A., Belin, Thomas R., and Schafer, Joseph L. 2007. Robustness of a multivariate normal approximation for imputation of incomplete binary data. Statistics in Medicine 26(6): 1368–82.Google Scholar
Cranmer, Skyler J., and Gill, Jeff. 2013. We have to be discrete about this: A non-parametric imputation technique for missing categorical data. British Journal of Political Science 43(2): 425–49.Google Scholar
Cribari-Neto, Francisco, and Zeileis, Achim. 2010. Beta regression in R. Journal of Statistical Software 34(2): 124.Google Scholar
Demirtas, Hakan. 2010. A distance-based rounding strategy for post-imputation ordinal data. Journal of Applied Statistics 37(3): 489500.Google Scholar
Dempster, Arthur P., Laird, Nan, and Rubin, Donald B. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B (Methodological) 39(1): 138.CrossRefGoogle Scholar
Gelman, Andrew, Jakulin, Aleks, Pittau, Maria Grazia, and Su, Yu-Sung. 2008. A weakly informative default prior distribution for logistic and other regression models. Annals of Applied Statistics 2(4): 1360–83.Google Scholar
Gelman, Andrew, Su, Yu-Sung, Yajima, Masanao, Hill, Jennifer, Grazia Pittau, Maria, Kerman, Jouni, and Zheng, Tian. 2012. arm: Data analysis using regression and multilevel/hierarchical models. R package version 1.5–05. http://CRAN.R-project.org/package=arm.Google Scholar
Goodrich, Ben, Kropko, Jonathan, Gelman, Andrew, and Hill, Jennifer. 2012. mi: Iterative multiple imputation from conditional distributions. R package version 2.15.1.Google Scholar
Greenland, Sander, and Finkle, William D. 1995. A critical look at methods for handling missing covariates in epidemiologic regression analyses. American Journal of Epidemiology 142(12): 1255–64.Google Scholar
Honaker, James, and King, Gary. 2010. What to do about missing values in time-series cross-section data. American Journal of Political Science 54(2): 561–81.Google Scholar
Honaker, James, King, Gary, and Blackwell, Matthew. 2011. Amelia II: A program for missing data. Journal of Statistical Software 45(7): 147.Google Scholar
Honaker, James, King, Gary, and Blackwell, Matthew. 2012. Amelia II: A program for missing data. Software documentation, version 1.6.2. http://r.iq.harvard.edu/docs/amelia/amelia.pdf.Google Scholar
Horton, Nicholas J., Lipsitz, Stuart R., and Parzen, Michael. 2003. A potential for bias when rounding in multiple imputation. American Statistician 57(4): 229–32.Google Scholar
Kropko, Jonathan, Goodrich, Ben, Gelman, Andrew, and Hill, Jennifer. 2014. Replication data for: Multiple imputation for continuous and categorical data: Comparing joint multivariate normal and conditional approaches, http://dx.doi.org/10.7910/DVN/24672UNF:5:QuxE8nFhbW2JZT+OW9WzWw==IQSS Dataverse Network [Distributor] V1 [Version].CrossRefGoogle Scholar
Lee, Katherine J., and Carlin, John B. 2010. Multiple imputation for missing data: Fully conditional specification versus multivariate normal imputation. American Journal of Epidemiology 171(5): 624–32.CrossRefGoogle ScholarPubMed
Lewandowski, Daniel, Kurowicka, Dorota, and Joe, Harry. 2010. Generating random correlation matrices based on vines and extended onion method. Journal of Multivariate Analysis 100(9): 19892001.CrossRefGoogle Scholar
Li, Fan, Yu, Yaming, and Rubin, Donald B. 2012. Imputing missing data by fully conditional models: Some cautionary examples and guidelines. Working paper. ftp.stat.duke.edu/WorkingPapers/11-24.pdf. Accessed 7 December 2012.Google Scholar
Royston, Patrick. 2005. Multiple imputation of missing values: Update. Stata Journal 5(2): 188201.Google Scholar
Royston, Patrick. 2007. Multiple imputation of missing values: Further update of ice, with an emphasis on interval censoring. Stata Journal 7(4): 445–74.Google Scholar
Royston, Patrick. 2009. Multiple imputation of missing values: Further update of ice, with an emphasis on categorical variables. Stata Journal 9(3): 466–77.CrossRefGoogle Scholar
Rubin, Donald B. 1978. Multiple imputations in sample surveys. Proceedings of the Survey Research Methods Section of the American Statistical Association.Google Scholar
Rubin, Donald B. Statistical matching using file concatenation with adjusted weights and multiple imputations. Journal of Business and Economic Statistics 4(1): 8794.Google Scholar
Rubin, Donald B. 1987. Multiple imputation for nonresponse in surveys. New York: John Wiley and Sons.Google Scholar
Rubin, Donald B., and Little, Roderick J. A. 2002. Statistical analysis with missing data. 2nd ed. New York: John Wiley and Sons.Google Scholar
Schafer, Joseph L. 1997. Analysis of incomplete multivariate data. London: Chapman & Hall.CrossRefGoogle Scholar
Schafer, Joseph L., and Olsen, Maren K. 1998. Multiple imputation for multivariate missing-data problems: A data analyst's perspective. Multivariate Behavioral Research 33(4): 545–71.CrossRefGoogle ScholarPubMed
StataCorp. 2013. Stata 13 base reference manual. College Station, TX: Stata Press.Google Scholar
Su, Yu-Sung, Gelman, Andrew, Hill, Jennifer, and Yajima, Masanao. 2011. Multiple imputation with diagnostics (mi) in R: Opening windows into the black box. Journal of Statistical Software 45(2): 131.Google Scholar
Therneau, Terry. 2012. survival: A package for survival analysis in S. R package version 2.36–14.Google Scholar
van Buuren, Stef. 2007. Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research 16(3): 219–42.CrossRefGoogle ScholarPubMed
van Buuren, Stef. 2012. Flexible imputation of missing data. Boca Raton, FL: Chapman & Hall/CRC.CrossRefGoogle Scholar
van Buuren, Stef, Boshuizen, Hendriek C., and Knook, D. L. 1999. Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine 18(6): 681–94.Google Scholar
van Buuren, Stef, and Groothuis-Oudshoorn, Karin. 2011. mice: Multivariate imputation by chained equations in R. Journal of Statistical Software 45(3): 167.Google Scholar
Venables, William N., and Ripley, Brian D. 2002. Modern applied statistics with S. 4th ed. New York: Springer.Google Scholar
Yu, L-M, Burton, Andrea, and Rivero-Arias, Oliver. 2007. Evaluation of software for multiple imputation of semi-continuous data. Statistical Methods in Medical Research 16(3): 243–58.CrossRefGoogle ScholarPubMed
Yuan, Yang C. 2013. Multiple imputation for missing data: Concepts and new development (Version 9.0). SAS Software Technical Papers.Google Scholar