# R-miss-tastic

A resource website on missing data

On this platform we attempt to give you an overview of main references on missing values. We do not claim to gather all available references on the subject but rather to offer a peak into different fields of active research on handling missing values, allowing for an introductory reading as well as a starting point for further bibliographical research.

See here for a full (and uncommented) list of references.

Inspired by CRAN Task View on Missing Data and a review of Imbert & Villa-Vialaneix on handling missing values (2018, written in French) we organized our selection of relevant references on missing values by different topics.

In order to provide a more formal introduction for the problem of missing values and the existing methods to handle them (e.g. diagnose/describe the missingness or perform statistical analysis on the incomplete data), we introduce some farely standard definitions and notations used in the remainder of this article.

• Let $$X=(X_1,\dots, X_p)$$ be a vector of $$p$$ random variables which can be continuous or categorical.

• We note $$x_{ij}$$ the observation of variable $$X_j$$ for an individual $$i\in\{1,\dots,n\}$$ and $$\mathbf{x}_i=(x_{i1},\dots,x_{ip})$$ the vector of observations of all $$p$$ variables $$X$$ for the individual $$i$$.

• The observations of the $$n$$ individuals are stacked by rows in a matrix $$\mathbf{X}\in\mathbb{R}^{n\times p}$$.

• The indicator matrix of missing values $$\mathbf{R}$$ is defined such that its values $$(r_{ij})_{\substack{i=1,\dots,n\\j=1,\dots,p}}$$ are given by: $$r_{ij} = \left\{\begin{array}{ll}1 & \text{ if } x_{ij} \text{ is observed}\\0 & \text{ otherwise}\end{array}\right. = \mathbb{1}_{x_{ij}\, is\, observed}$$. The associated random variable is denoted by $$R$$.

• The observed and missing parts of $$X$$ are denoted respectively by $$X_{obs}$$ and $$X_{mis}$$.

These general references and reviews are helpful to get started with the large field of missing values as they provide an introduction to the main concepts and methods or give an overview of the diversity of topics in statistical analysis related to missing values. They discuss different mechanisms that generated the missing values, necessary conditions for working consistently on the observed values alone and ways to impute, i.e. complete, the missing values to end up with complete datasets allowing the use of standard statistical analysis methods.

• Allison, P. D. Missing Data. Quantitative Applications in the Social Sciences. Thousand Oaks, CA, USA: Sage Publications, 2001. ISBN: 9780761916727.
• Buuren, S. van. Flexible Imputation of Missing Data. Boca Raton, FL: Chapman and Hall/CRC, 2018.
• Carpenter, J. and M. Kenward. Multiple Imputation and its Application. Chichester, West Sussex, UK: Wiley, 2013. ISBN: 9780470740521.
• Enders, C. K. Applied Missing Data Analysis. Guilford Press, 2010, p. 401. ISBN: 9781606236390.
• Kim, J. K. and J. Shao. Statistical Methods for Handling Incomplete Data. Boca Raton, FL, USA: Chapman and Hall/CRC, 2013. ISBN: 9781482205077.
• Little, R. J. A. and D. B. Rubin. Statistical Analysis with Missing Data. Wiley, 2002, p. 408. ISBN: 0471183865.
• Molenberghs, G., G. Fitzmaurice, M. G. Kenward, et al. Handbook of Missing Data Methodology. Chapman & Hall/CRC Handbooks of Modern Statistical Methods. New York, NY, USA: Chapman and Hall/CRC, 2014. ISBN: 9781439854624.
• Molenberghs, G. and M. G. Kenward. Missing Data in Clinical Studies. Chichester, West Sussex, UK: Wiley, 2007. ISBN: 9780470849811.
• O’Kelly, M. and B. Ratitch. Clinical Trials with Missing Data: A Guide for Practitioners. John Wiley & Sons, Ltd, 2014.
• Schafer, J. L. Analysis of Incomplete Multivariate Data. CRC Monographs on Statistics & Applied Probability. Boca Raton, FL, USA: Chapman and Hall/CRC, 1997. ISBN: 0412040611.
• Graham, J. W. Missing data analysis: making it work in the real world. In: Annual Review of Psychology 60 (2009), pp. 549-576.
• Kaiser, J. Dealing with missing values in data. In: Journal of Systems Integration 5.1 (2014), pp. 42-51.
• Pigott, T. D. A review of methods for missing data. In: Educational Research and Evaluation 7.4 (2001), pp. 353–383.
• Schafer, J. L. and J. W. Graham. Missing data: our view of the state of the art. In: Psychological Methods 7.2 (2002), pp. 147-177.
• Orchard, T. and M. A. Woodbury. A missing information principle: theory and applications. In: Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Theory of Statistic. Ed. by L. M. Le Cam, N. J. and E. L. Scott. Vol. 1. University of California Press, 1972, pp. 697–715.

If you are rather new to the subject and wish to start with less formal and more application-based introductions or if you look for general high-level advices on handling missing data we suggest the following publications:

• National Research Council, U. The Prevention and Treatment of Missing Data in Clinical Trials. Washington (DC), USA: National Academies Press, 2010. ISBN: 9780309158145.
• Baraldi, A. N. and C. K. Enders. An introduction to modern missing data analysis. In: Journal of School Psychology 48.1 (2010), pp. 5-37.
• Dax, A. Imputing Missing Entries of a Data Matrix: A review. In: Journal of Advanced Computing 3.3 (2014), pp. 98-222.
• Dong, Y. and C. J. Peng. Principled missing data methods for researchers. In: SpringerPlus 2 (2013), p. 222.
• Horton, N. J. and K. P. Kleinman. Much Ado About Nothing - A Comparison of Missing Data Methods and Software to Fit Incomplete Data Regression Models. In: The American Statistician 61.1 (2017), pp. 79-90.
• Meng, X. L. UYou want me to analyze data I don’t have? Are you insane? In: Shanghai Archives of Psychiatry 24.5 (2012), pp. 287-301.
• Peugh, J. L. and C. K. Enders. Missing data in educational research: a review of reporting practices and suggestions for improvement. In: Review of Educational Research 74.4 (2004), pp. 525–556.

Furthermore you can have a look at the following statistical journals which regularly contain recent results related to handling missing data:

The first intuitive and probably most applied solution in data analyses to deal with missing values is to delete the partial observations and to work excusively on the individuals with complete information. This has several drawbacks, among others it introduces an estimation bias in most cases (more precisely in cases where the missingness is not independent of the data). In order to reduce this bias one can reweight the complete observations to compensate for the deletion of incomplete individuals in the dataset. The weights are defined by inverse probabilities, for instance the inverse of the probability for each individual of being fully observed. This method is known as inverse probability weighting and is described in detail in the publications below. We split the references in two parts: handling missing values in survey data and performing causal inference in the presence of missing values, both requiring the use of weighting methods.

For survey data analysis

Such weighting methods are widely used on survey data in order to correct for unbalanced sampling fractions by balancing the empirical distributions of the observed covariates to recover the structure of the target population.

• Buck, S. F. A method of estimation of missing values in multivariate data suitable for use with an electronic computer. In: Journal of the Royal Statistical Society, Series B 22 (1960), pp. 302-306.
• Carpenter, J. R., M. G. Kenward and S. Vansteelandt. A comparison of multiple imputation and doubly robust estimation for analyses with missing data. In: Journal of the Royal Statistical Society: Series A (Statistics in Society) 169.3 (2006), pp. 571–584.
• Fitzmaurice, G. M., G. Molenberghs and S. R. Lipsitz. Regression Models for Longitudinal Binary Responses with Informative Drop-Outs. In: Journal of the Royal Statistical Society. Series B (Methodological) 57.4 (1995), pp. 691–704.
• Gelman, A., G. King and C. Liu. Not asked and not answered: Multiple imputation for multiple surveys. In: Journal of the American Statistical Association 93.443 (1998), pp. 846–857.
• Kalton, G. and D. Kasprzyk. The treatment of missing survey data. In: Survey Methodology 12.1 (1986), pp. 1-16.
• Preisser, J. S., K. K. Lohman and P. J. Rathouz. Performance of weighted estimating equations for longitudinal binary data with drop-outs missing at random. In: Statistics in Medicine 21.20 (2002), pp. 3035–3054.
• Robins, J. M., A. Rotnitzky and L. P. Zhao. Estimation of Regression Coefficients When Some Regressors are not Always Observed. In: Journal of the American Statistical Association 89.427 (1994), pp. 846-866.
• Rubin, D. B. Formalizing subjective notions about the effect of nonrespondents in sample surveys. In: Journal of the American Statistical Association 72.359 (1977), pp. 538-543.
• Seaman, S. R. and I. R. White. Review of inverse probability weighting for dealing with missing data. In: Statistical Methods in Medical Research 22.3 (2011), pp. 278-295.
• Vansteelandt, S., J. Carpenter and M. G. Kenward. Analysis of incomplete data using inverse probability weighting and doubly robust estimators. In: Methodology – European Journal of Research Methods for the Behavioral and Social Sciences 6.1 (2010), pp. 37–48.

For causal inference

Inverse probability weighting is also considered in causal inference: A bias is induced by the presence of confounders, i.e. variables which interact with both covariates and outcome. Hence, if the goal is to estimate causal relationships between covariates and outcome it is necessary to account for the potential effect of confounders – a selection bias – on the result of causal inference.

• Bang, H. and J. M. Robins. Doubly robust estimation in missing data and causal inference models. In: Biometrics 61.4 (2005), pp. 962-973.
• Blake, H. A., C. Leyrat, K. Mansfield, et al. Propensity scores using missingness pattern information: a practical guide. In: arXiv preprint (2019). arXiv: 1901.03981 [stat.ME].
• Ding, P. and F. Li. Causal Inference: A Missing Data Perspective. In: Statistical Science 33.2 (2018), pp. 214–237.
• Hogan, J. W. and T. Lancaster. Instrumental variables and inverse probability weighting for causal inference from longitudinal observational studies. In: Statistical Methods in Medical Research 13.1 (2004), pp. 17-48.
• Wal, W. M. van der and R. B. Geskus. ipw: an R package for inverse probability weighting. In: Journal of Statistical Software 43.13 (2011).
• Yang, S., L. Wang and P. Ding. Identification and estimation of causal effects with confounders subject to instrumental missingness. In: Statistics Methodology Repository (2017).
• Kallus, N., X. Mao and M. Udell. Causal Inference with Noisy and Missing Covariates via Matrix Factorization. In: Advances in Neural Information Processing Systems. Ed. by -. 2018. eprint: 1806.00811.

Let $$x_i$$ be an observation with missing values, e.g. each entry of $$x_i$$ could be the temperature at a certain day for one given place and unfortunately for some days the temperature was not measured. An intuitive idea to replace this missing information could be: take other observations $$\{x_j\}_j$$ which are similar to $$x_i$$ at the observed values and use this information to fill in the gaps. This idea of taking observed values from neighbours or donors based on some similarity measure is implemented in the so-called hot-deck and k-nearest-neighbors (kNN) approaches.

• Andridge, R. and R. J. A. Little. A review of hot deck imputation for survey non-response. In: International Statistical Review 78.1 (2010), pp. 40-64.
• Huisman, M. Imputation of missing item responses: some simple techniques. In: Quality & Quantity 34.4 (2000), pp. 331-351.
• Imbert, A., A. Valsesia, C. Le Gall, et al. Multiple hot-deck imputation for network inference from RNA sequencing data. In: Bioinformatics 34.10 (2018), pp. 1726-1732.
• Joenssen, D. W. and U. Bankhofer. Donor limited hot deck imputation: effect on parameter estimation. In: Journal of Theoretical and Applied Computer Science 6.3 (2012), pp. 58-70.
• Rao, J. N. K. and J. Shao. Jackknife variance estimation with survey data under hot deck imputation. In: Biometrika 79.4 (1992), pp. 811-822.
• Reilly, M. and M. Pepe. The relationship between hot-deck multiple imputation and weighted likelihood. In: Statistics in Medecine 16.1-3 (1997), pp. 5-19.
• Voillet, V., P. Besse, L. Liaubet, et al. Handling missing rows in multi-omics data integration: multiple imputation in multiple factor analysis framework. In: BMC Bioinformatics 17.402 (2016). Forthcoming.

Likelihood-based approaches in the presence of missing values are designed for statistical inference from incomplete data. More precisely, if the missingness mechanism is ignorable (in a certain sense that is explained in the Missing values mechanisms section) then one can attempt to infer the model parameters by maximizing the likelihood on the observed values. When the mechanism cannot be ignored, then a specific model for it needs to be assumed. The main algorithm available for performing maximum likelihood estimation (ML) with missing values, is the Expectation Maximization (EM) algorithm. This algorithm requires the knowledge of the joint distribution of $$X = (X_{obs}, X_{mis})$$ and its implementation is not straightforward since it involves integrals which cannot always be computed easily. Once the model parameters are estimated, one can impute the missing values using this estimated information on the data model.

• McLachlan, G. J. and T. Krishnan. The EM Algorithm and Extensions. Wiley series in probability and statistics. Hoboken, NJ, USA: Wiley, 2008. ISBN: 9780471201700.
• Collins, L. M., J. L. Schafer and K. Chi-Ming. A comparison of inclusive and restrictive strategies in modern missing data procedures. In: Psychological Methods 6.4 (2007), pp. 330-351.
• Enders, C. K. A primer on maximum likelihood algorithms available for use with missing data. In: Structural Equation Modeling 8.1 (2001), pp. 128-141.
• Finkbeiner, C. Estimation for the multiple factor model when data are missing. In: Psychometrika 44.4 (1979), pp. 409-420.
• Ibrahim, J. G., S. R. Lipsitz and M. Chen. Missing Covariates in Generalized Linear Models When the Missing Data Mechanism is Non-Ignorable. In: Journal of the Royal Statistical Society. Series B (Statistical Methodology) 61.1 (1999), pp. 173-190.
• Ibrahim, J. G., M. Chen and S. R. Lipsitz. Missing responses in generalised linear mixed models when the missing data mechanism is nonignorable. In: Biometrika 88.2 (2001), pp. 551-564.
• Jones, M. P. Indicator and Stratification Methods for Missing Explanatory Variables in Multiple Linear Regression. In: Journal of the American Statistical Association 91.433 (1996), pp. 222-230.
• Little, R. J. A. Regression with missing X’s: a review. In: Journal of the American Statistical Association 87.420 (1992), pp. 1227-1237.
• Louis, T. A. Finding the Observed Information Matrix when Using the EM Algorithm. In: Journal of the Royal Statistical Society. Series B (Methodological) 44.2 (1982), pp. 226–233.
• Meng, S. L. and D. B. Rubin. Maximum likelihood estimation via the ECM algorithm: a general framework. In: Biometrika 80.2 (1993), pp. 267-278.
• Meng, X. L. and D. B. Rubin. Using EM to obtain asymptotic variance-covariance matrices: the SEM algorithm. In: Journal of the American Statistical Association 86.416 (1991), pp. 899-909.
• Rosseel, Y. lavaan: an R package for structural equation modeling. In: Journal of Statistical Software 48.2 (2012).
• Rubin, D. B. Inference and missing data. In: Biometrika 63.3 (1976), pp. 581-592.
• Stubbendick, A. L. and J. G. Ibrahim. Maximum Likelihood Methods for Nonignorable Missing Responses and Covariates in Random Effects Models. In: Biometrics 59.4 (2003), pp. 1140–1150.
• Stubbendick, A. L. and J. G. Ibrahim. Likelihood-based inference with nonignorable missing responses and covariates in models for discrete longitudinal data. In: Statistica Sinica 16.4 (2006), pp. 1143–1167.
• Tchetgen Tchetgen, E. J., L. Wang and B. Sun. Discrete choice models for nonmonotone nonignorable missing data: identification and inference. In: Statistica Sinica 28.4 (2018), pp. 2069–2088.
• Zhou, Y., R. J. A. Little and J. D. Kalbfleisch. Block-conditional missing at random models for missing data. In: Statistical Science 25.4 (2010), pp. 517–532.

In the previously mentioned EM algorithm there is in fact an implicit step called imputation: imputing a missing value means replacing it with a plausible one. The definition of plausibility is not stated explicitly but can be deduced from the used method to fill in the gaps, for instance one could choose to replace all missing values of a certain variable $$X_j$$ by the average observed value $$\frac{1}{n_{obs,j}}\sum_{i} x_{ij}\mathbb{1}_{\{x_{ij} \, is\, observed\}}$$, where $$n_{obs,j} = \sum_{i} \mathbb{1}_{\{x_{ij} \, is\, observed\}}$$. The interest of imputation is manifold: (1) it allows to use all information in the sample (instead of deleting incomplete observations which leads to a decreasing power in the statistical analysis), (2) if there is sufficient data, i.e. sufficient observations, then the imputation can be very accurate and this assures good quality of future statistical analyses and (3) the imputed dataset is a complete dataset and one can apply standard statistical inference methods. The latter however has to be treated with caution since it implies that in the statistical analysis one does not make any distinction between observed values and imputed values anymore. We will come back to this issue in the next section on multiple imputation.

• Audigier, V., F. Husson and J. Josse. A principal component method to impute missing values for mixed data. In: Advances in Data Analysis and Classification 10.1 (2016), pp. 5-26.
• Cranmer, S. J. and J. Gill. We have to be discrete about this: a non-parametric imputation technique for missing categorical data. In: British Journal of Political Science 43 (2012), pp. 425-449.
• Crookston, N. L. and A. O. Finley. yaImpute: an R package for kNN imputation. In: Journal of Statistical Software 23 (2008), p. 10.
• Dax, A. Imputing Missing Entries of a Data Matrix: A review. In: Journal of Advanced Computing 3.3 (2014), pp. 98-222.
• Ding, Y. and J. S. Simonoff. An investigation of missing data methods for classification trees applied to binary response data. In: Journal of Machine Learning Research 11.1 (2010), pp. 131-170.
• Fellegi, I. P. and D. Holt. A systematic approach to automatic edit and imputation. In: Journal of the American Statistical Association 71.353 (1976), pp. 17-35.
• Ferrari, P. A., P. Annoni, A. Barbiero, et al. An imputation method for categorical variables with application to nonlinear principal component analysis. In: Computational Statistics & Data Analysis 55.7 (2011), pp. 2410-2420.
• Finkbeiner, C. Estimation for the multiple factor model when data are missing. In: Psychometrika 44.4 (1979), pp. 409-420.
• Huisman, M. Imputation of missing item responses: some simple techniques. In: Quality & Quantity 34.4 (2000), pp. 331-351.
• Husson, F. and J. Josse. Handling missing values in multiple factor analysis. In: Food Quality and Preference 30 (2013), pp. 77-85.
• Ilin, A. and T. Raiko. Practical approaches to Principal Component Analysis in the presence of missing values. In: Journal of Machine Learning Research 11 (2010), pp. 1957-2000.
• Joenssen, D. W. and U. Bankhofer. Donor limited hot deck imputation: effect on parameter estimation. In: Journal of Theoretical and Applied Computer Science 6.3 (2012), pp. 58-70.
• Josse, J., M. Chavent, B. Liquet, et al. Handling missing values with regularized iterative multiple correspondance analysis. In: Journal of Classification 29.1 (2012), pp. 91-116.
• Josse, J., F. Husson and J. Pagès. Gestion des données manquantes en Analyse en Composantes Principales. In: Journal de la Société Française de Statistique 150.2 (2009), pp. 28-51.
• Kalton, G. and D. Kasprzyk. The treatment of missing survey data. In: Survey Methodology 12.1 (1986), pp. 1-16.
• Kohn, R. and C. F. Ansley. Estimation, prediction, and interpolation for ARIMA models with missing data. In: Journal of the American Statistical Association 81.395 (1986), pp. 751-761.
• Kowarik, A. and M. Templ. Imputation with the R Package VIM. In: Journal of Statistical Software 74.7 (2016), pp. 1-16.
• Moritz, S. and T. Bartz-Beielstein. imputeTS: time series missing value imputation in R. In: The R Journal 9.1 (2017), pp. 207-218.
• Stacklies, W., H. Redestig, M. Scholz, et al. pcaMethods – a bioconductor package providing PCA methods for incomplete data. In: Bioconductor 23.9 (2007), pp. 1164-1167.
• Troyanskaya, O., M. Cantor, G. Sherlock, et al. Missing value estimation methods for DNA microarrays. In: Bioinformatics 17.6 (2001), pp. 520-525.
• Unnebrink, K. and J. Windeler. Intention-to-treat: methods for dealing with missing values in clinical trials of progressively deteriorating diseases. In: Statistics in Medecine 20.24 (2001), pp. 3931-3946.
• Verbanck, M., J. Josse and F. Husson. Regularised PCA to denoise and visualise data. In: Statistics and Computing 25.2 (2015), pp. 471-486.
• Zhang, H., P. Xie and E. Xing. Missing Value Imputation Based on Deep Generative Models. In: Computing Research Repository abs/1808.01684 (2018).
• Zhang, S. Nearest neighbor selection for iterative kNN imputation. In: Journal of Systems and Software 85.11 (2012), pp. 2541-2552.
• Tran, L., X. Liu, J. Zhou, et al. Missing Modalities Imputation via Cascaded Residual Autoencoder. In: 2017 IEEE Conference on Computer Vision and PAttern Recognition (CVPR). (Jul. 21, 2017-Jul. 26, 2017). Ed. by -. IEEE, 2017, pp. 4971-4980.
• Moritz, S., A. Sardá, T. Bartz-Beielstein, et al. Comparison of different methods for univariate time series imputation in R. Prepint arXiv 1510.03924. 2015.

A major drawback of single imputation, i.e. where every missing value is replaced by a single most plausible value, consists in the underestimation of the overall variance of the data and inferred parameters. Indeed, by replacing every missing value by a given plausible one and by applying generic statistical methods on the completed dataset, one makes no difference between initially observed and unobserved data anymore. Therefore the variability due to the uncertainty of the missing values is not reflected in future statistical analyses which treat the dataset as if it had been fully observed from the beginning. A nice and conceptually simple workaround for this problem is multiple imputation: instead of generating a single complete dataset by a given imputation method one imputes every missing value by several possible values. Statistical analysis is then applied on each of the imputed datasets and the resulting estimations are aggregated and used to estimate the sample variance and the variance due to the uncertainty in the missing values.

• Carpenter, J. and M. Kenward. Multiple Imputation and its Application. Chichester, West Sussex, UK: Wiley, 2013. ISBN: 9780470740521.
• Rubin, D. B. Multlipe Imputation for Nonresponse in Surveys. Hoboken, NJ, USA: Wiley, 1987. ISBN: 9780471655740.
• Abayomi, K., A. Gelman and M. Levy. Diagnostics for multivariate imputations. In: Journal of the Royal Statistical Society, Series C (Applied Statistics) 57.3 (2008), pp. 273-291.
• Audigier, V., F. Husson and J. Josse. MIMCA: multiple imputation for categorical variables with multiple correspondence analysis. In: Statistics and Computing 27.2 (2016), pp. 1-18. eprint: 1505.08116.
• Audigier, V., F. Husson and J. Josse. Multiple imputation for continuous variables using a Bayesian principal component analysis. In: Journal of Statistical Computation and Simulation 86.11 (2015), pp. 2140-2156.
• Buuren, S. van. Multiple imputation of discrete and continuous data by fully conditional specification. In: Statistical Methods in Medical Research 16 (2007), pp. 219-242.
• Buuren, S. van, J. P. L. Brand, C. G. M. Groothuis-Oudshoorn, et al. Fully conditional specification in multivariate imputation. In: Journal of Statistical Computation and Simulation 76.12 (2006), pp. 1049-1064.
• Buuren, S. van and K. Groothuis-Oudshoorn. MICE: multivariate imputation by chained equations in R. In: Journal of Statistical Software 45 (2011), p. 3. eprint: NIHMS150003.
• Carpenter, J. R., M. G. Kenward and S. Vansteelandt. A comparison of multiple imputation and doubly robust estimation for analyses with missing data. In: Journal of the Royal Statistical Society: Series A (Statistics in Society) 169.3 (2006), pp. 571–584.
• Collins, L. M., J. L. Schafer and K. Chi-Ming. A comparison of inclusive and restrictive strategies in modern missing data procedures. In: Psychological Methods 6.4 (2007), pp. 330-351.
• Fay, R. E. Alternative paradigms for the analysis of imputed survey data. In: Journal of the American Statistical Association 91.434 (1996), pp. 490-498.
• Gelman, A., G. King and C. Liu. Not asked and not answered: Multiple imputation for multiple surveys. In: Journal of the American Statistical Association 93.443 (1998), pp. 846–857.
• Gelman, A., I. van Mechelen, G. Verbeke, et al. Multiple Imputation for Model Checking: Completed-Data Plots with Missing and Latent Data. In: Biometrics 61.1 (2005), pp. 74–85.
• Graham, J. W., A. E. Olchowski and T. E. Gilreath. How many imputations are really needed? Some practical clarifications of multiple imputation theory. In: Prevention Science 8.3 (2007), pp. 206-213.
• Honaker, J., G. King and M. Blackwell. Amelia II: a program for missing data. In: Journal of Statistical Software 45.7 (2011). eprint: arXiv:1501.0228.
• Imbert, A., A. Valsesia, C. Le Gall, et al. Multiple hot-deck imputation for network inference from RNA sequencing data. In: Bioinformatics 34.10 (2018), pp. 1726-1732.
• Josse, J. and F. Husson. missMDA: a package for handling missing values in multivariate data analysis. In: Journal of Statistical Software 70.1 (2016), pp. 1-31.
• Josse, J. and F. Husson. Handling missing values in exploratory multivariate data analysis methods. In: Journal de la Société Française de Statistique 153.2 (2012), pp. 79-99.
• Josse, J., J. Pagès and F. Husson. Multiple imputation in principal component analysis. In: Advances in Data Analysis and Classification 5.3 (2011), pp. 231-246.
• Kropko, J., B. Goodrich, A. Gelman, et al. Multiple Imputation for Continuous and Categorical Data: Comparing Joint Multivariate Normal and Conditional Approaches. In: Political Analysis 22.4 (2014), pp. 497–519.
• Murray, J. S. and J. P. Reiter. Multiple Imputation of Missing Categorical and Continuous Values via Bayesian Mixture Models With Local Dependence. In: Journal of the American Statistical Association 111.516 (2016), pp. 1466-1479.
• Robins, J. M. and N. Wang. Inference for imputation estimators. In: Biometrika 87.1 (2000), pp. 113-124.
• Rubin, D. B. Multiple imputation after 18+ years. In: Journal of the American Statistical Association 91.434 (2012), pp. 473-489.
• Schafer, J. L. Multiple imputation: a primer. In: Statistical Methods in Medical Research 8.1 (1999), pp. 3-15.
• Schafer, J. L. and M. K. Olsen. Multiple Imputation for multivariate missing-data problems: a data analyst’s perspective. In: Multivariate Behavioral Research 33.4 (1998), pp. 545-571.
• Stuart, E. A., M. Azur, C. Frangakis, et al. Multiple imputation with large data sets: a case study of the children’s mental health initiative. In: American Journal of Epidemiology 169.9 (2009), pp. 1133-1139.
• Su, Y. S., A. Gelman, J. Hill, et al. Multiple imputation with diagnostics (mi) in R: opening windows into the black box. In: Journal of Statistical Software 45 (2011), p. 2.
• Voillet, V., P. Besse, L. Liaubet, et al. Handling missing rows in multi-omics data integration: multiple imputation in multiple factor analysis framework. In: BMC Bioinformatics 17.402 (2016). Forthcoming.
• Wang, N. and J. M. Robins. Large-sample theory for parametric multiple imputation procedures. In: Biometrika 85.4 (1998), pp. 935–948.
• Xie, X. and X. L. Meng. Dissecting multiple imputation from a multi-phase inference perspective: what happens when God’s, imputer’s and analyst’s models are uncongenial? In: Statistica Sinica 27.4 (2017), pp. 1485–1594.
• Gondara, L. and K. Wang. MIDA: Multiple Imputation using Denoising Autoencoders. In: Proceedings of the 22nd Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2018). (Jun. 03, 2018-Jun. 06, 2018). Ed. by D. Phung, V. Tseng, G. Webb, B. Ho, M. Ganji and L. Rashidi. Lecture Notes in Computer Science. Springer International Publishing, 2018, pp. 260-272. ISBN: 3319930404.

The field of machine learning being dependent on the availability of (good) training data, it is – in most real-world applications – necessarily facing the issue of missing data. Hence there has been an increasing attention to how to handle missing data, in the features and the output, in order to learn accurately from the data.

Trees and forests

Decision trees are models based on recursive executions of elementary rules. This architecture grants them a variety of simple options to deal with missing values, without requiring prior imputation. A popular class of decision tree models is called random trees (or more generally random forests) and allows data analyses such as causal inference in the presence of missing values without the need of having to impute these missing values.

• Ding, Y. and J. S. Simonoff. An investigation of missing data methods for classification trees applied to binary response data. In: Journal of Machine Learning Research 11.1 (2010), pp. 131-170.
• Hothorn, T., K. Hornik and A. Zeileis. Unbiased Recursive Partitioning: A Conditional Inference Framework. In: Journal of Computational and Graphical Statistics 15.3 (2012), pp. 651-674.
• Kapelner, A. and J. Bleich. Prediction with missing data via Bayesian additive regression trees. In: Canadian Journal of Statistics 43.2 (2015), pp. 224-239.
• Rahman, G. and Z. Islam. Missing value imputation using decision trees and decision forests by splitting and merging records: Two novel techniques. In: Knowledge-Based Systems 53 (2013), pp. 51–65.
• Stekhoven, D. J. and P. Bühlmann. Missforest-non-parametric missing value imputation for mixed-type data. In: Bioinformatics 28.1 (2012), pp. 112-118. eprint: 1105.0828.
• Strobl, C., A. L. Boulesteix and T. Augustin. Unbiased split selection for classification trees based on the Gini Index. In: Computational Statistics & Data Analysis 52.1 (2007), pp. 483-501.
• Tierney, N. J., F. A. Harden, M. J. Harden, et al. Using decision trees to understand structure in missing data. In: BMJ Open 5.6 (2015), p. e007450.
• Twala, B. E. T. H., M. C. Jones and D. J. Hand. Good methods for coping with missing data in decision trees. In: Pattern Recognition Letters 29.7 (2008), pp. 950-956.
• Chen, T. and C. Guestrin. XGBoost: A Scalable Tree Boosting System. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (Aug. 13, 2016-Aug. 17, 2016). Ed. by -. New York, NY, USA: ACM, 2016, pp. 785-794. ISBN: 0450342322.
• Rieger, A., T. Hothorn and C. Strobl. Random forests with missing values in the covariates. Tech. rep. 79. University of Munich, Department of Statistics, 2010.

Deep Learning

The advance and success of (deep) neural networks in many research and application areas such as computer vision and natural language processing has also re-discovered the problem of handling missing values. Indeed the question of training neural networks on incomplete data has been considered even before the latest rise of deep learning and is considered to be essential due to the impact of missingness on the feasibility and quality of various learning problems.

• Sharpe, P. K. and R. J. Solly. Dealing with missing values in neural network-based diagnostic systems. In: Neural Computing & Applications 3.2 (1995), pp. 73-77.
• Śmieja, M., Ł. Struski, J. Tabor, et al. Processing of missing data by neural networks. In: Computing Research Repository abs/1805.07405 (2018). eprint: 1805.07405.
• Sovilj, D., E. Eirola, Y. Miche, et al. Extreme learning machine for missing data using multiple imputations. In: Neurocomputing 174.A (2016), pp. 220-231.
• Zhang, H., P. Xie and E. Xing. Missing Value Imputation Based on Deep Generative Models. In: Computing Research Repository abs/1808.01684 (2018).
• Bengio, Y. and F. Gingras. Recurrent neural networks for missing or asynchronous data. In: Proceedings of the 8th International Conference on Neural Information Processing Systems. (Nov. 27, 1995-Dec. 02, 1995). Ed. by -. Cambridge, MA, USA: MIT Press, 1995, pp. 395-401.
• Biessmann, F., D. Salinas, S. Schelter, et al. “Deep” Learning for Missing Value Imputation in Tables with Non-Numerical Data. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management. Ed. by -. CIKM ’18. Torino, Italy: ACM, 2018, pp. 2017–2025. ISBN: 978-1-4503-6014-2.
• Gondara, L. and K. Wang. MIDA: Multiple Imputation using Denoising Autoencoders. In: Proceedings of the 22nd Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2018). (Jun. 03, 2018-Jun. 06, 2018). Ed. by D. Phung, V. Tseng, G. Webb, B. Ho, M. Ganji and L. Rashidi. Lecture Notes in Computer Science. Springer International Publishing, 2018, pp. 260-272. ISBN: 3319930404.
• Goodfellow, I., M. Mirza, A. Courville, et al. Multi-Prediction Deep Boltzmann Machines. In: Proceedings of the 26th International Conference on Neural Information Processing Systems. (Dec. 05, 2013-Dec. 10, 2013). Ed. by C. Burges, L. Bottou, M. Welling, Z. Ghahramani and K. Weinberger. Advances in Neural Information Processing Systems 26. Curran Associates, Inc., 2013, pp. 548–556.
• Nowicki, R. K., R. Scherer and L. Rutkowski. Novel rough neural network for classification with missing data. In: 21st International Conference on Methods and Models in Automation and Robotics (MMAR). (Sep. 29, 2016-Sep. 01, 2016). Ed. by -. IEEE, 2016, pp. 820–825.
• Tran, L., X. Liu, J. Zhou, et al. Missing Modalities Imputation via Cascaded Residual Autoencoder. In: 2017 IEEE Conference on Computer Vision and PAttern Recognition (CVPR). (Jul. 21, 2017-Jul. 26, 2017). Ed. by -. IEEE, 2017, pp. 4971-4980.
• Yoon, J., J. Jordon and M. van der Schaar. GAIN: Missing Data Imputation using Generative Adversarial Nets. In: Proceedings of the 35th International Conference on Machine Learning. (Jul. 10, 2018-Jul. 15, 2018). Ed. by J. Dy and A. Krause. Vol. 80. Proceedings of Machine Learning Research. Stockholmsmässan, Stockholm Sweden: PMLR, 2018, pp. 5689–5698.

As mentioned in the above sections, it is necessary to make assumptions on the mechanism generating the missing values or response mechanism in order to work with missing values. Broadly speaking, these assumptions indicate how much the missingness is related to the data itself. The assumptions made on the mechanism impact further steps in the data analysis (since some types of missingness can induce a bias on the analysis results) and are therefore crucial for valid analyses of data in the presence of missing values.

More formally, the response mechanism is defined as the conditional distribution of $$R$$ given $$X$$, $$f(R|X)$$. This distribution can depend on some parameter $$\psi$$ so that we have $$f(R|X;\psi)$$. Little and Rubin (2002) defined three main categories of missing values depending on the form of the conditional distribution $$f$$:

• Missing completely at random (MCAR): The missingness does not depend on the variables $$X$$, i.e.

$f(R|X;\psi) = f(R;\psi).$

• Missing at random (MAR): The missingness depends only on the observed variables $$X_{obs}$$, i.e.

$f(R|X;\psi) = f(R|X_{obs};\psi),$

or alternatively $$f(R|X^1;\psi) = f(R|X^2;\psi)$$ for all $$X^1 = (X^1_{obs},X^1_{mis})$$ and $$X^2 = (X^2_{obs},X^2_{mis})$$ such that $$X^1_{obs} = X^2_{obs}$$.

• Missing not at random (MNAR): The missingness depends on the observed and missing values, i.e. $f(R|X;\psi) \neq f(R|X_{obs};\psi).$ To understand this definition, take the example of alcohol consumption: alcoholics are less inclined to reveal their alcohol consumption, therefore the probability of missing information on the alcohol consumption depends on the amount of consumption itself. Another simple example is the information on income or wealth which is missing more often for individuals of very high or very low income.

Note that MCAR is a special case of MAR and that these three categories are of increasing complexity with a large gap between the second and third. Indeed, most more or less generic methods which have been proposed in the last few decades are suited for data that is MAR. The case MNAR requires different techniques and further assumptions.

Note that Little and Rubin (2002) consider these three categories as really missing values as opposed to not really missing values where, in the case of categorical data, the missingness rather constitutes an additional category (for instance in a questionnaire with multiple choice answers, a participant can leave out a question because the category he wants to choose is not among the given choices).

• Wainer, H., ed. Drawing Inferences from Self-Selected Samples. New York, NY, USA: Springer, 1986.
• Albert, P. S. and D. A. Follmann. Modeling repeated count data subject to informative dropout. In: Biometrics 56.3 (2000), pp. 667-677.
• Diggle, P. and M. G. Kenward. Informative drop-out in longitudinal data analysis. In: Journal of the Royal Statistical Society, Series C (Applied Statistics) 43.1 (1994), pp. 49-93.
• Fang, F., J. Zhao and J. Shao. Imputation-based adjusted score equations in generalized linear models with nonignorable missing covariate values. In: Statistica Sinica 28.4 (2018), pp. 1677–1701.
• Follmann, D. and M. Wu. An approximate generalized linear model with random effects for informative missing data. In: Biometrics 51.1 (1995), pp. 151-168.
• Gad, A. M. and N. M. M. Darwish. A shared parameter model for longitudinal data with missing values. In: American Journal of Applied Mathematics and Statistics 1.2 (2013), pp. 30-35.
• Heckman, J. J. The common structure of statistical models of truncation, sample selection and limited dependent variables and a simple estimator for such models. In: Annals of Economic and Social Measurement 5.4 (1976), pp. 475-492.
• Ibrahim, J. G., S. R. Lipsitz and M. Chen. Missing Covariates in Generalized Linear Models When the Missing Data Mechanism is Non-Ignorable. In: Journal of the Royal Statistical Society. Series B (Statistical Methodology) 61.1 (1999), pp. 173-190.
• Ibrahim, J. G., M. Chen and S. R. Lipsitz. Missing responses in generalised linear mixed models when the missing data mechanism is nonignorable. In: Biometrika 88.2 (2001), pp. 551-564.
• Jamshidian, M. and S. Jalal. Tests of homoscedasticity, normality, and missing completely at random for incomplete multivariate data. In: Psychometrika 75.4 (2010), pp. 649-674. eprint: NIHMS150003.
• Jamshidian, M., S. Jalal and C. Jansen. MissMech: an R package for testing homoscedasticity, multivariate normality, and missing completely at random (MCAR). In: Journal of Statistical Software 56.6 (2014), pp. 1-31.
• Lee, K. M., R. Mitra and S. Biedermann. Optimal design when outcome values are not missing at random. In: Statistica Sinica 28.4 (2018), pp. 1821–1838.
• Little, R. J. A. Modeling the drop-out mechanism in repeated-measures studies. In: Journal of the American Statistical Association 90.431 (1995), pp. 1112-1121.
• Little, R. J. A. Pattern-mixture models for multivariate incomplete data. In: Journal of the American Statistical Association 88.421 (1993), pp. 125-134.
• Little, R. J. A. A test of missing completely at random for multivariate data with missing values. In: Journal of the American Statistical Association 83.404 (1988), pp. 1198-1202.
• Miao, W. and E. J. Tchetgen Tchetgen. Identification and inference with nonignorable missing covariate data. In: Statistica Sinica 28.4 (2018), pp. 2049–2067.
• Molenberghs, G., B. Michiels, M. G. Kenward, et al. Monotone missing data and pattern-mixture models. In: Statistica Neerlandica 52.2 (1998), pp. 153-161.
• Robins, J. M., A. Rotnitzky and L. P. Zhao. Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. In: Journal of the American Statistical Association 90.429 (1995), pp. 106-121.
• Rotnitzky, A., J. M. Robins and D. O. Scharfstein. Semiparametric regression for repeated outcomes with nonignorable nonresponse. In: Journal of the American Statistical Association 93.444 (1998), pp. 1321-1339.
• Sadinle, M. and J. P. Reiter. Sequential Identification of Nonignorable Missing Data Mechanisms. In: Statistica Sinica 28.4 (2018), pp. 1741–1759.
• Seaman, S., J. Galati, D. Jackson, et al. What Is Meant by “Missing at Random”? In: Statistical Science 28.2 (2013), pp. 257–268.
• Shao, J. and J. Zhang. A transformation approach in linear mixed-effects models with informative missing responses. In: Biometrika 102.1 (2015), pp. 107-119.
• Simon, G. A. and J. S. Simonoff. Diagnostic plots for missing data in least squares regression. In: Journal of the American Statistical Association 81.394 (1986), pp. 501-509.
• Stubbendick, A. L. and J. G. Ibrahim. Maximum Likelihood Methods for Nonignorable Missing Responses and Covariates in Random Effects Models. In: Biometrics 59.4 (2003), pp. 1140–1150.
• Stubbendick, A. L. and J. G. Ibrahim. Likelihood-based inference with nonignorable missing responses and covariates in models for discrete longitudinal data. In: Statistica Sinica 16.4 (2006), pp. 1143–1167.
• Tchetgen Tchetgen, E. J., L. Wang and B. Sun. Discrete choice models for nonmonotone nonignorable missing data: identification and inference. In: Statistica Sinica 28.4 (2018), pp. 2069–2088.
• Templ, M., A. Alfons and P. Filzmoser. Exploring Incomplete data using visualization techniques. In: Advances in Data Analysis and Classification 6.1 (2012), pp. 29-47.
• Thijs, H., G. Molenberghs, B. Michiels, et al. Strategies to fit pattern-mixture models. In: Biostatistics 3.2 (2002), pp. 245-265.
• Tierney, N. J., F. A. Harden, M. J. Harden, et al. Using decision trees to understand structure in missing data. In: BMJ Open 5.6 (2015), p. e007450.
• Verbeke, G., G. Molenberghs, H. Thijs, et al. Sensitivity analysis for nonrandom dropout: a local influence approach. In: Biometrics 57.1 (2001), pp. 7-14.
• Vansteelandt, S., A. Rotnitzky and J. Robins. Estimation of regression models for the mean of repeated outcomes under nonignorable nonmonotone nonresponse. In: Biometrika 94.4 (2007), pp. 841–860.
• White, I. R., J. Carpenter and N. J. Horton. A mean score method for sensitivity analysis to departures from the missing at random assumption in randomised trials. In: Statistica Sinica 28.4 (2018), pp. 1985–2003.
• Wu, M. C. and R. J. Carroll. Estimation and comparison of changes in the presence of informative right censoring by modeling the censoring process. In: Biometrics 44.1 (1988), pp. 175-188.
• Zhou, Y., R. J. A. Little and J. D. Kalbfleisch. Block-conditional missing at random models for missing data. In: Statistical Science 25.4 (2010), pp. 517–532.
• Tierney, N. and D. Cook. Expanding tidy data principles to facilitate missing data exploration, visualization and assessment of imputations. Monash Econometrics and Business Statistics Working Papers 14/18. Monash University, Department of Econometrics and Business Statistics, 2018.