R-miss-tastic

A resource website on missing values - Methods and references for managing missing data

On this platform we attempt to give you an overview of main references on missing values. We do not claim to gather all available references on the subject but rather to offer a peak into different fields of active research on handling missing values, allowing for an introductory reading as well as a starting point for further bibliographical research.

See here for a full (and uncommented) list of references.

Inspired by CRAN Task View on Missing Data and a review of Imbert & Vialaneix on handling missing values (2018, written in French) we organized our selection of relevant references on missing values by different topics.

In order to provide a more formal introduction for the problem of missing values and the existing methods to handle them (e.g. diagnose/describe the missingness or perform statistical analysis on the incomplete data), we introduce some farely standard definitions and notations used in the remainder of this article.

  • Let \(X=(X_1,\dots, X_p)\) be a vector of \(p\) random variables which can be continuous or categorical.

  • We note \(x_{ij}\) the observation of variable \(X_j\) for an individual \(i\in\{1,\dots,n\}\) and \(\mathbf{x}_i=(x_{i1},\dots,x_{ip})\) the vector of observations of all \(p\) variables \(X\) for the individual \(i\).

  • The observations of the \(n\) individuals are stacked by rows in a matrix \(\mathbf{X}\in\mathbb{R}^{n\times p}\).

  • The indicator matrix of missing values \(\mathbf{R}\) is defined such that its values \((r_{ij})_{\substack{i=1,\dots,n\\j=1,\dots,p}}\) are given by: \(r_{ij} = \left\{\begin{array}{ll}1 & \text{ if } x_{ij} \text{ is observed}\\0 & \text{ otherwise}\end{array}\right. = \mathbb{1}_{x_{ij}\, is\, observed}\). The associated random variable is denoted by \(R\).

  • The observed and missing parts of \(X\) are denoted respectively by \(X_{obs}\) and \(X_{mis}\).


These general references and reviews are helpful to get started with the large field of missing values as they provide an introduction to the main concepts and methods or give an overview of the diversity of topics in statistical analysis related to missing values. They discuss different mechanisms that generated the missing values, necessary conditions for working consistently on the observed values alone and ways to impute, i.e. complete, the missing values to end up with complete datasets allowing the use of standard statistical analysis methods.

  • Allison, P. D. Missing Data. Quantitative Applications in the Social Sciences. Thousand Oaks, CA, USA: Sage Publications, 2001. ISBN: 9780761916727.
    DOI
  • Carpenter, J. and M. Kenward. Multiple Imputation and its Application. Chichester, West Sussex, UK: Wiley, 2013. ISBN: 9780470740521.
    DOI
  • Enders, C. K. Applied Missing Data Analysis. Guilford Press, 2010, p. 401. ISBN: 9781606236390.
  • Kim, J. K. and J. Shao. Statistical Methods for Handling Incomplete Data. Boca Raton, FL, USA: Chapman and Hall/CRC, 2013. ISBN: 9781482205077.
  • Little, R. J. A. and D. B. Rubin. Statistical Analysis with Missing Data. Wiley, 2002, p. 408. ISBN: 0471183865.
    DOI
  • Molenberghs, G., G. Fitzmaurice, M. G. Kenward, et al. Handbook of Missing Data Methodology. Chapman & Hall/CRC Handbooks of Modern Statistical Methods. New York, NY, USA: Chapman and Hall/CRC, 2014. ISBN: 9781439854624.
  • Molenberghs, G. and M. G. Kenward. Missing Data in Clinical Studies. Chichester, West Sussex, UK: Wiley, 2007. ISBN: 9780470849811.
    DOI
  • O’Kelly, M. and B. Ratitch. Clinical Trials with Missing Data: A Guide for Practitioners. John Wiley & Sons, Ltd, 2014.
    DOI
  • Schafer, J. L. Analysis of Incomplete Multivariate Data. CRC Monographs on Statistics & Applied Probability. Boca Raton, FL, USA: Chapman and Hall/CRC, 1997. ISBN: 0412040611.
  • Buuren, S. van. Flexible Imputation of Missing Data. Boca Raton, FL: Chapman and Hall/CRC, 2018.
    URL
  • Graham, J. W. Missing data analysis: making it work in the real world. In: Annual Review of Psychology 60 (2009), pp. 549-576.
    DOI
  • Kaiser, J. Dealing with missing values in data. In: Journal of Systems Integration 5.1 (2014), pp. 42-51.
    DOI
  • Pigott, T. D. A review of methods for missing data. In: Educational Research and Evaluation 7.4 (2001), pp. 353–383.
    DOI
  • Schafer, J. L. and J. W. Graham. Missing data: our view of the state of the art. In: Psychological Methods 7.2 (2002), pp. 147-177.
    DOI
  • Orchard, T. and M. A. Woodbury. A missing information principle: theory and applications. In: Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Theory of Statistic. Ed. by L. M. Le Cam, N. J. and E. L. Scott. Vol. 1. University of California Press, 1972, pp. 697–715.
    URL

If you are rather new to the subject and wish to start with less formal and more application-based introductions or if you look for general high-level advices on handling missing data we suggest the following publications:

  • National Research Council, U. The Prevention and Treatment of Missing Data in Clinical Trials. Washington (DC), USA: National Academies Press, 2010. ISBN: 9780309158145.
    DOI
  • Baraldi, A. N. and C. K. Enders. An introduction to modern missing data analysis. In: Journal of School Psychology 48.1 (2010), pp. 5-37.
    DOI
  • Dax, A. Imputing Missing Entries of a Data Matrix: A review. In: Journal of Advanced Computing 3.3 (2014), pp. 98-222.
    DOI
  • Dong, Y. and C. J. Peng. Principled missing data methods for researchers. In: SpringerPlus 2 (2013), p. 222.
    DOI
  • Horton, N. J. and K. P. Kleinman. Much Ado About Nothing - A Comparison of Missing Data Methods and Software to Fit Incomplete Data Regression Models. In: The American Statistician 61.1 (2017), pp. 79-90.
    DOI
  • Meng, X. L. UYou want me to analyze data I don’t have? Are you insane? In: Shanghai Archives of Psychiatry 24.5 (2012), pp. 287-301.
  • Peugh, J. L. and C. K. Enders. Missing data in educational research: a review of reporting practices and suggestions for improvement. In: Review of Educational Research 74.4 (2004), pp. 525–556.

Furthermore you can have a look at the following statistical journals which regularly contain recent results related to handling missing data:

The first intuitive and probably most applied solution in data analyses to deal with missing values is to delete the partial observations and to work excusively on the individuals with complete information. This has several drawbacks, among others it introduces an estimation bias in most cases (more precisely in cases where the missingness is not independent of the data). In order to reduce this bias one can reweight the complete observations to compensate for the deletion of incomplete individuals in the dataset. The weights are defined by inverse probabilities, for instance the inverse of the probability for each individual of being fully observed. This method is known as inverse probability weighting and is described in detail in the publications below. We split the references in two parts: handling missing values in survey data and performing causal inference in the presence of missing values, both requiring the use of weighting methods.


For survey data analysis

Such weighting methods are widely used on survey data in order to correct for unbalanced sampling fractions by balancing the empirical distributions of the observed covariates to recover the structure of the target population.

  • Buck, S. F. A method of estimation of missing values in multivariate data suitable for use with an electronic computer. In: Journal of the Royal Statistical Society, Series B 22 (1960), pp. 302-306.
    DOI
  • Carpenter, J. R., M. G. Kenward, and S. Vansteelandt. A comparison of multiple imputation and doubly robust estimation for analyses with missing data. In: Journal of the Royal Statistical Society: Series A (Statistics in Society) 169.3 (2006), pp. 571–584.
    DOI
  • Fitzmaurice, G. M., G. Molenberghs, and S. R. Lipsitz. Regression Models for Longitudinal Binary Responses with Informative Drop-Outs. In: Journal of the Royal Statistical Society. Series B (Methodological) 57.4 (1995), pp. 691–704.
    URL
  • Gelman, A., G. King, and C. Liu. Not asked and not answered: Multiple imputation for multiple surveys. In: Journal of the American Statistical Association 93.443 (1998), pp. 846–857.
    DOI
  • Kalton, G. and D. Kasprzyk. The treatment of missing survey data. In: Survey Methodology 12.1 (1986), pp. 1-16.
    URL
  • Preisser, J. S., K. K. Lohman, and P. J. Rathouz. Performance of weighted estimating equations for longitudinal binary data with drop-outs missing at random. In: Statistics in Medicine 21.20 (2002), pp. 3035–3054.
    DOI
  • Robins, J. M., A. Rotnitzky, and L. P. Zhao. Estimation of Regression Coefficients When Some Regressors are not Always Observed. In: Journal of the American Statistical Association 89.427 (1994), pp. 846-866.
    DOI
  • Rubin, D. B. Formalizing subjective notions about the effect of nonrespondents in sample surveys. In: Journal of the American Statistical Association 72.359 (1977), pp. 538-543.
    DOI
  • Vansteelandt, S., J. Carpenter, and M. G. Kenward. Analysis of incomplete data using inverse probability weighting and doubly robust estimators. In: Methodology – European Journal of Research Methods for the Behavioral and Social Sciences 6.1 (2010), pp. 37–48.
    DOI

Methods in common with causal inference

Inverse probability weighting is also considered in causal inference: A bias is induced by the presence of confounders, i.e. variables which interact with both covariates and outcome. Hence, if the goal is to estimate causal relationships between covariates and outcome it is necessary to account for the potential effect of confounders – a selection bias – on the result of causal inference.

  • Bang, H. and J. M. Robins. Doubly robust estimation in missing data and causal inference models. In: Biometrics 61.4 (2005), pp. 962-973.
    DOI
  • Bartlett, J. W., O. Harel, and J. R. Carpenter. Asymptotically unbiased estimation of exposure odds ratios in complete records logistic regression. In: American journal of epidemiology 182.8 (2015), pp. 730–736.
    DOI
  • Blake, H. A., C. Leyrat, K. Mansfield, et al. Propensity scores using missingness pattern information: a practical guide. In: arXiv preprint (2019). arXiv: 1901.03981 [stat.ME].
    URL
  • Ding, P. and F. Li. Causal Inference: A Missing Data Perspective. In: Statistical Science 33.2 (2018), pp. 214–237.
    DOI
  • Hogan, J. W. and T. Lancaster. Instrumental variables and inverse probability weighting for causal inference from longitudinal observational studies. In: Statistical Methods in Medical Research 13.1 (2004), pp. 17-48.
    DOI
  • Seaman, S. R. and S. Vansteelandt. Introduction to Double Robust Methods for Incomplete Data. In: Statistical Science 33.2 (2018), p. 184.
    DOI
  • Seaman, S. R. and I. R. White. Review of inverse probability weighting for dealing with missing data. In: Statistical Methods in Medical Research 22.3 (2011), pp. 278-295.
    DOI
  • Wal, W. M. van der and R. B. Geskus. ipw: an R package for inverse probability weighting. In: Journal of Statistical Software 43.13 (2011).
    DOI
  • Yang, S., L. Wang, and P. Ding. Identification and estimation of causal effects with confounders subject to instrumental missingness. In: Statistics Methodology Repository (2017).
    URL
  • Zhu, Z., T. Wang, and R. J. Samworth. High-dimensional principal component analysis with heterogeneous missingness. In: arXiv preprint (2019).
    URL
  • Kallus, N., X. Mao, and M. Udell. Causal Inference with Noisy and Missing Covariates via Matrix Factorization. In: Advances in Neural Information Processing Systems. Ed. by -. 2018. eprint: 1806.00811.
    URL


Let \(x_i\) be an observation with missing values, e.g. each entry of \(x_i\) could be the temperature at a certain day for one given place and unfortunately for some days the temperature was not measured. An intuitive idea to replace this missing information could be: take other observations \(\{x_j\}_j\) which are similar to \(x_i\) at the observed values and use this information to fill in the gaps. This idea of taking observed values from neighbours or donors based on some similarity measure is implemented in the so-called hot-deck and k-nearest-neighbors (kNN) approaches.

  • Andridge, R. and R. J. A. Little. A review of hot deck imputation for survey non-response. In: International Statistical Review 78.1 (2010), pp. 40-64.
    DOI
  • Huisman, M. Imputation of missing item responses: some simple techniques. In: Quality & Quantity 34.4 (2000), pp. 331-351.
    DOI
  • Imbert, A., A. Valsesia, C. Le Gall, et al. Multiple hot-deck imputation for network inference from RNA sequencing data. In: Bioinformatics 34.10 (2018), pp. 1726-1732.
    DOI
  • Joenssen, D. W. and U. Bankhofer. Donor limited hot deck imputation: effect on parameter estimation. In: Journal of Theoretical and Applied Computer Science 6.3 (2012), pp. 58-70.
    URL
  • Rao, J. N. K. and J. Shao. Jackknife variance estimation with survey data under hot deck imputation. In: Biometrika 79.4 (1992), pp. 811-822.
    DOI
  • Reilly, M. and M. Pepe. The relationship between hot-deck multiple imputation and weighted likelihood. In: Statistics in Medecine 16.1-3 (1997), pp. 5-19.
    DOI
  • Voillet, V., P. Besse, L. Liaubet, et al. Handling missing rows in multi-omics data integration: multiple imputation in multiple factor analysis framework. In: BMC Bioinformatics 17.402 (2016). Forthcoming.
    DOI


The most popular approach to deal with missing values for statistical inference tasks is likelihood-based approaches that can deal with incomplete data. More precisely, if the missingness mechanism is ignorable (in a certain sense that is explained in the Missing values mechanisms section) then one can attempt to infer the model parameters by maximizing the likelihood on the observed values. When the mechanism cannot be ignored, then a specific model for it needs to be assumed. The main algorithm available for performing maximum likelihood estimation (ML) with missing values, is the Expectation Maximization (EM) algorithm. This algorithm requires the knowledge of the joint distribution of \(X = (X_{obs}, X_{mis})\) and its implementation is not straightforward since it involves integrals which cannot always be computed easily. Once the model parameters are estimated, one can impute the missing values using this estimated information on the data model.

And there exist also other methods that allow for statistical inference with missing values and that are not using likelihood maximization.

  • McLachlan, G. J. and T. Krishnan. The EM Algorithm and Extensions. Wiley series in probability and statistics. Hoboken, NJ, USA: Wiley, 2008. ISBN: 9780471201700.
  • Collins, L. M., J. L. Schafer, and K. Chi-Ming. A comparison of inclusive and restrictive strategies in modern missing data procedures. In: Psychological Methods 6.4 (2007), pp. 330-351.
    DOI
  • Enders, C. K. A primer on maximum likelihood algorithms available for use with missing data. In: Structural Equation Modeling 8.1 (2001), pp. 128-141.
    DOI
  • Finkbeiner, C. Estimation for the multiple factor model when data are missing. In: Psychometrika 44.4 (1979), pp. 409-420.
    DOI
  • Golden, R. M., S. S. Henley, H. White, et al. Consequences of model misspecification for maximum likelihood estimation with missing data. In: Econometrics 7.3 (2019), p. 37.
    DOI
  • Ibrahim, J. G., M. Chen, and S. R. Lipsitz. Missing responses in generalised linear mixed models when the missing data mechanism is nonignorable. In: Biometrika 88.2 (2001), pp. 551-564.
    DOI
  • Ibrahim, J. G., S. R. Lipsitz, and M. Chen. Missing Covariates in Generalized Linear Models When the Missing Data Mechanism is Non-Ignorable. In: Journal of the Royal Statistical Society. Series B (Statistical Methodology) 61.1 (1999), pp. 173-190.
  • Jiang, W., J. Josse, and M. Lavielle. Logistic Regression with Missing Covariates–Parameter Estimation, Model Selection and Prediction. In: arXiv preprint (2018). arXiv: 1805.04602 [stat.ME].
  • Jones, M. P. Indicator and Stratification Methods for Missing Explanatory Variables in Multiple Linear Regression. In: Journal of the American Statistical Association 91.433 (1996), pp. 222-230.
    DOI
  • Little, R. J. A. Regression with missing X’s: a review. In: Journal of the American Statistical Association 87.420 (1992), pp. 1227-1237.
    DOI
  • Louis, T. A. Finding the Observed Information Matrix when Using the EM Algorithm. In: Journal of the Royal Statistical Society. Series B (Methodological) 44.2 (1982), pp. 226–233.
    URL
  • Lüdtke, O., A. Robitzsch, and S. G. West. Regression models involving nonlinear effects with missing data: A sequential modeling approach using Bayesian estimation. In: Psychological methods (2019).
    DOI
  • Meng, S. L. and D. B. Rubin. Maximum likelihood estimation via the ECM algorithm: a general framework. In: Biometrika 80.2 (1993), pp. 267-278.
    DOI
  • Meng, X. L. and D. B. Rubin. Using EM to obtain asymptotic variance-covariance matrices: the SEM algorithm. In: Journal of the American Statistical Association 86.416 (1991), pp. 899-909.
    DOI
  • Rosseel, Y. lavaan: an R package for structural equation modeling. In: Journal of Statistical Software 48.2 (2012).
    DOI
  • Rubin, D. B. Inference and missing data. In: Biometrika 63.3 (1976), pp. 581-592.
    DOI
  • Stubbendick, A. L. and J. G. Ibrahim. Maximum Likelihood Methods for Nonignorable Missing Responses and Covariates in Random Effects Models. In: Biometrics 59.4 (2003), pp. 1140–1150.
    DOI
  • Stubbendick, A. L. and J. G. Ibrahim. Likelihood-based inference with nonignorable missing responses and covariates in models for discrete longitudinal data. In: Statistica Sinica 16.4 (2006), pp. 1143–1167.
    URL
  • Tabouy, T., P. Barbillon, and J. Chiquet. Variational inference for stochastic block models from sampled data. In: Journal of the American Statistical Association 115.529 (2020), pp. 455–466.
    DOI
  • Tchetgen Tchetgen, E. J., L. Wang, and B. Sun. Discrete choice models for nonmonotone nonignorable missing data: identification and inference. In: Statistica Sinica 28.4 (2018), pp. 2069–2088.
    DOI
  • Xue, F. and A. Qu. Integrating multi-source block-wise missing data in model selection. In: Journal of the American Statistical Association (2020), pp. 1–36.
    DOI
  • Zhao, Y. Statistical inference for missing data mechanisms. In: Statistics in Medicine 39.28 (2020), pp. 4325–4333.
    DOI
  • Zhao, J. and Y. Ma. A versatile estimation procedure without estimating the nonignorable missingness mechanism. In: Journal of the American Statistical Association (2021), pp. 1–15.
    DOI
  • Zhou, Y., R. J. A. Little, and J. D. Kalbfleisch. Block-conditional missing at random models for missing data. In: Statistical Science 25.4 (2010), pp. 517–532.
    DOI
  • Londschien, M., S. Kovács, and P. Bühlmann. Change point detection for graphical models in presence of missing values. 2019. arXiv: 1907.05409 [stat.ML].


Regression

There is a vast literature on how to perform (linear) regression, possibly in high dimensional setting, in presence of missing values in the covariates. This can be seen as a particular case of supervised learning, which is presented below even if the focus is often more on estimating parameters or selecting relevant variables.

  • Jiang, W., M. Bogdan, J. Josse, et al. Adaptive Bayesian SLOPE–High-dimensional Model Selection with Missing Values. In: arXiv preprint (2019).
    URL
  • Golden, R. M., S. S. Henley, H. White, et al. Consequences of model misspecification for maximum likelihood estimation with missing data. In: Econometrics 7.3 (2019), p. 37.
    DOI
  • Jiang, W., J. Josse, and M. Lavielle. Logistic Regression with Missing Covariates–Parameter Estimation, Model Selection and Prediction. In: arXiv preprint (2018). arXiv: 1805.04602 [stat.ME].
  • Jones, M. P. Indicator and Stratification Methods for Missing Explanatory Variables in Multiple Linear Regression. In: Journal of the American Statistical Association 91.433 (1996), pp. 222-230.
    DOI
  • Lüdtke, O., A. Robitzsch, and S. G. West. Regression models involving nonlinear effects with missing data: A sequential modeling approach using Bayesian estimation. In: Psychological methods (2019).
    DOI
  • Loh, P. and M. J. Wainwright. High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity. In: Advances in Neural Information Processing Systems. Ed. by J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira and K. Q. Weinberger. Vol. 24. Curran Associates, Inc., 2011, pp. 2726–2734.
    URL


In the previously mentioned EM algorithm there is in fact an implicit step called imputation: imputing a missing value means replacing it with a plausible one. The definition of plausibility is not stated explicitly but can be deduced from the used method to fill in the gaps, for instance one could choose to replace all missing values of a certain variable \(X_j\) by the average observed value \(\frac{1}{n_{obs,j}}\sum_{i} x_{ij}\mathbb{1}_{\{x_{ij} \, is\, observed\}}\), where \(n_{obs,j} = \sum_{i} \mathbb{1}_{\{x_{ij} \, is\, observed\}}\). The interest of imputation is manifold: (1) it allows to use all information in the sample (instead of deleting incomplete observations which leads to a decreasing power in the statistical analysis), (2) if there is sufficient data, i.e. sufficient observations, then the imputation can be very accurate and this assures good quality of future statistical analyses and (3) the imputed dataset is a complete dataset and one can apply standard statistical inference methods. The latter however has to be treated with caution since it implies that in the statistical analysis one does not make any distinction between observed values and imputed values anymore. We will come back to this issue in the next section on multiple imputation.

  • Audigier, V., F. Husson, and J. Josse. A principal component method to impute missing values for mixed data. In: Advances in Data Analysis and Classification 10.1 (2016), pp. 5-26.
    DOI
  • Bertsimas, D., C. Pawlowski, and Y. D. Zhuo. From predictive methods to missing data imputation: an optimization approach. In: The Journal of Machine Learning Research 18.1 (2017), pp. 7133–7171.
  • Cranmer, S. J. and J. Gill. We have to be discrete about this: a non-parametric imputation technique for missing categorical data. In: British Journal of Political Science 43 (2012), pp. 425-449.
    DOI
  • Crookston, N. L. and A. O. Finley. yaImpute: an R package for kNN imputation. In: Journal of Statistical Software 23 (2008), p. 10.
    DOI
  • Dax, A. Imputing Missing Entries of a Data Matrix: A review. In: Journal of Advanced Computing 3.3 (2014), pp. 98-222.
    DOI
  • Ding, Y. and J. S. Simonoff. An investigation of missing data methods for classification trees applied to binary response data. In: Journal of Machine Learning Research 11.1 (2010), pp. 131-170.
    URL
  • Fellegi, I. P. and D. Holt. A systematic approach to automatic edit and imputation. In: Journal of the American Statistical Association 71.353 (1976), pp. 17-35.
    DOI
  • Ferrari, P. A., P. Annoni, A. Barbiero, et al. An imputation method for categorical variables with application to nonlinear principal component analysis. In: Computational Statistics & Data Analysis 55.7 (2011), pp. 2410-2420.
    DOI
  • Finkbeiner, C. Estimation for the multiple factor model when data are missing. In: Psychometrika 44.4 (1979), pp. 409-420.
    DOI
  • Huisman, M. Imputation of missing item responses: some simple techniques. In: Quality & Quantity 34.4 (2000), pp. 331-351.
    DOI
  • Husson, F. and J. Josse. Handling missing values in multiple factor analysis. In: Food Quality and Preference 30 (2013), pp. 77-85.
    DOI
  • Ilin, A. and T. Raiko. Practical approaches to Principal Component Analysis in the presence of missing values. In: Journal of Machine Learning Research 11 (2010), pp. 1957-2000.
    URL
  • Joenssen, D. W. and U. Bankhofer. Donor limited hot deck imputation: effect on parameter estimation. In: Journal of Theoretical and Applied Computer Science 6.3 (2012), pp. 58-70.
    URL
  • Josse, J., M. Chavent, B. Liquet, et al. Handling missing values with regularized iterative multiple correspondance analysis. In: Journal of Classification 29.1 (2012), pp. 91-116.
    DOI
  • Josse, J., F. Husson, and J. Pagès. Gestion des données manquantes en Analyse en Composantes Principales. In: Journal de la Société Française de Statistique 150.2 (2009), pp. 28-51.
    URL
  • Kalton, G. and D. Kasprzyk. The treatment of missing survey data. In: Survey Methodology 12.1 (1986), pp. 1-16.
    URL
  • Kohn, R. and C. F. Ansley. Estimation, prediction, and interpolation for ARIMA models with missing data. In: Journal of the American Statistical Association 81.395 (1986), pp. 751-761.
    DOI
  • Kowarik, A. and M. Templ. Imputation with the R Package VIM. In: Journal of Statistical Software 74.7 (2016), pp. 1-16.
    DOI
  • Moritz, S. and T. Bartz-Beielstein. imputeTS: time series missing value imputation in R. In: The R Journal 9.1 (2017), pp. 207-218.
    URL
  • Tang, F. and H. Ishwaran. Random forest missing data algorithms. In: Statistical Analysis and Data Mining: The ASA Data Science Journal 10.6 (2017), pp. 363–377.
    DOI
  • Stacklies, W., H. Redestig, M. Scholz, et al. pcaMethods – a bioconductor package providing PCA methods for incomplete data. In: Bioconductor 23.9 (2007), pp. 1164-1167.
    DOI
  • Troyanskaya, O., M. Cantor, G. Sherlock, et al. Missing value estimation methods for DNA microarrays. In: Bioinformatics 17.6 (2001), pp. 520-525.
    DOI
  • Unnebrink, K. and J. Windeler. Intention-to-treat: methods for dealing with missing values in clinical trials of progressively deteriorating diseases. In: Statistics in Medecine 20.24 (2001), pp. 3931-3946.
    DOI
  • Verbanck, M., J. Josse, and F. Husson. Regularised PCA to denoise and visualise data. In: Statistics and Computing 25.2 (2015), pp. 471-486.
    DOI
  • Zhang, H., P. Xie, and E. Xing. Missing Value Imputation Based on Deep Generative Models. In: Computing Research Repository abs/1808.01684 (2018).
    URL
  • Zhang, S. Nearest neighbor selection for iterative kNN imputation. In: Journal of Systems and Software 85.11 (2012), pp. 2541-2552.
    DOI
  • Zhu, Z., T. Wang, and R. J. Samworth. High-dimensional principal component analysis with heterogeneous missingness. In: arXiv preprint (2019).
    URL
  • Tran, L., X. Liu, J. Zhou, et al. Missing Modalities Imputation via Cascaded Residual Autoencoder. In: 2017 IEEE Conference on Computer Vision and PAttern Recognition (CVPR). (Jul. 21, 2017-Jul. 26, 2017). Ed. by -. IEEE, 2017, pp. 4971-4980.
    DOI
  • Zhao, Y. and M. Udell. Missing value imputation for mixed data via gaussian copula. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020, pp. 636–646.
    DOI
  • Moritz, S., A. Sardá, T. Bartz-Beielstein, et al. Comparison of different methods for univariate time series imputation in R. Prepint arXiv 1510.03924. 2015.
    URL


Matrix factorization

A special case of imputation is matrix completion that exploits structural assumptions about the row and column spaces to impute the missing values.

  • Nguyen, L. T., J. Kim, and B. Shim. Low-Rank Matrix Completion: A Contemporary Survey. In: IEEE Access 7 (2019), pp. 94215–94237.
    DOI
  • Robin, G., O. Klopp, J. Josse, et al. Main Effects and Interactions in Mixed and Incomplete Data Frames. In: Journal of the American Statistical Association 115.531 (2020), pp. 1292-1303. eprint: https://doi.org/10.1080/01621459.2019.1623041.
  • Sportisse, A., C. Boyer, and J. Josse. Imputation and low-rank estimation with Missing Not At Random data. In: Statistics and Computing 30.6 (2018), pp. 1629-1643.
    DOI
  • Kallus, N., X. Mao, and M. Udell. Causal Inference with Noisy and Missing Covariates via Matrix Factorization. In: Advances in Neural Information Processing Systems. Ed. by -. 2018. eprint: 1806.00811.
    URL
  • Ma, W. and G. H. Chen. Missing Not at Random in Matrix Completion: The Effectiveness of Estimating Missingness Probabilities Under a Low Nuclear Norm Assumption. In: Advances in Neural Information Processing Systems 32. Ed. by H. Wallach, H. Larochelle, A. Beygelzimer, F. d. Alché-Buc, E. Fox and R. Garnett. Curran Associates, Inc., 2019, pp. 14900–14909.
    URL
  • Zhao, Y. and M. Udell. Missing value imputation for mixed data via gaussian copula. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020, pp. 636–646.
    DOI
  • Robin, G. Low-rank methods for heterogeneous and multi-source data. 2019.
    DOI


A major drawback of single imputation, i.e. where every missing value is replaced by a single most plausible value, consists in the underestimation of the overall variance of the data and inferred parameters. Indeed, by replacing every missing value by a given plausible one and by applying generic statistical methods on the completed dataset, one makes no difference between initially observed and unobserved data anymore. Therefore the variability due to the uncertainty of the missing values is not reflected in future statistical analyses which treat the dataset as if it had been fully observed from the beginning. A nice and conceptually simple workaround for this problem is multiple imputation: instead of generating a single complete dataset by a given imputation method one imputes every missing value by several possible values. Statistical analysis is then applied on each of the imputed datasets and the resulting estimations are aggregated and used to estimate the sample variance and the variance due to the uncertainty in the missing values.

  • Carpenter, J. and M. Kenward. Multiple Imputation and its Application. Chichester, West Sussex, UK: Wiley, 2013. ISBN: 9780470740521.
    DOI
  • Rubin, D. B. Multlipe Imputation for Nonresponse in Surveys. Hoboken, NJ, USA: Wiley, 1987. ISBN: 9780471655740.
  • Abayomi, K., A. Gelman, and M. Levy. Diagnostics for multivariate imputations. In: Journal of the Royal Statistical Society, Series C (Applied Statistics) 57.3 (2008), pp. 273-291.
    DOI
  • Audigier, V., F. Husson, and J. Josse. Multiple imputation for continuous variables using a Bayesian principal component analysis. In: Journal of Statistical Computation and Simulation 86.11 (2015), pp. 2140-2156.
    DOI
  • Audigier, V., F. Husson, and J. Josse. MIMCA: multiple imputation for categorical variables with multiple correspondence analysis. In: Statistics and Computing 27.2 (2016), pp. 1-18. eprint: 1505.08116.
    DOI
  • Carpenter, J. R., M. G. Kenward, and S. Vansteelandt. A comparison of multiple imputation and doubly robust estimation for analyses with missing data. In: Journal of the Royal Statistical Society: Series A (Statistics in Society) 169.3 (2006), pp. 571–584.
    DOI
  • Collins, L. M., J. L. Schafer, and K. Chi-Ming. A comparison of inclusive and restrictive strategies in modern missing data procedures. In: Psychological Methods 6.4 (2007), pp. 330-351.
    DOI
  • Erler, N. S., D. Rizopoulos, and E. M. Lesaffre. JointAI: joint analysis and imputation of incomplete data in R. In: arXiv preprint (2019).
    URL
  • Fay, R. E. Alternative paradigms for the analysis of imputed survey data. In: Journal of the American Statistical Association 91.434 (1996), pp. 490-498.
    DOI
  • Gelman, A., G. King, and C. Liu. Not asked and not answered: Multiple imputation for multiple surveys. In: Journal of the American Statistical Association 93.443 (1998), pp. 846–857.
    DOI
  • Gelman, A., I. van Mechelen, G. Verbeke, et al. Multiple Imputation for Model Checking: Completed-Data Plots with Missing and Latent Data. In: Biometrics 61.1 (2005), pp. 74–85.
    DOI
  • Graham, J. W., A. E. Olchowski, and T. E. Gilreath. How many imputations are really needed? Some practical clarifications of multiple imputation theory. In: Prevention Science 8.3 (2007), pp. 206-213.
    DOI
  • Honaker, J., G. King, and M. Blackwell. Amelia II: a program for missing data. In: Journal of Statistical Software 45.7 (2011). eprint: arXiv:1501.0228.
    DOI
  • Imbert, A., A. Valsesia, C. Le Gall, et al. Multiple hot-deck imputation for network inference from RNA sequencing data. In: Bioinformatics 34.10 (2018), pp. 1726-1732.
    DOI
  • Josse, J., J. Pagès, and F. Husson. Multiple imputation in principal component analysis. In: Advances in Data Analysis and Classification 5.3 (2011), pp. 231-246.
    DOI
  • Josse, J. and F. Husson. Handling missing values in exploratory multivariate data analysis methods. In: Journal de la Société Française de Statistique 153.2 (2012), pp. 79-99.
    URL
  • Josse, J. and F. Husson. missMDA: a package for handling missing values in multivariate data analysis. In: Journal of Statistical Software 70.1 (2016), pp. 1-31.
    DOI
  • Kropko, J., B. Goodrich, A. Gelman, et al. Multiple Imputation for Continuous and Categorical Data: Comparing Joint Multivariate Normal and Conditional Approaches. In: Political Analysis 22.4 (2014), pp. 497–519.
    DOI
  • Larose, C., D. K. Dey, and O. Harel. The impact of missing values on different measures of uncertainty. In: Statistica Sinica 29.2 (2019), pp. 551–566.
    DOI
  • Murray, J. S. and J. P. Reiter. Multiple Imputation of Missing Categorical and Continuous Values via Bayesian Mixture Models With Local Dependence. In: Journal of the American Statistical Association 111.516 (2016), pp. 1466-1479.
    DOI
  • Quartagno, M. and J. R. Carpenter. Multiple imputation for discrete data: Evaluation of the joint latent normal model. In: Biometrical Journal 61.4 (2019), pp. 1003–1019.
    DOI
  • Robins, J. M. and N. Wang. Inference for imputation estimators. In: Biometrika 87.1 (2000), pp. 113-124.
    URL
  • Rubin, D. B. Multiple imputation after 18+ years. In: Journal of the American Statistical Association 91.434 (2012), pp. 473-489.
    DOI
  • Schafer, J. L. and M. K. Olsen. Multiple Imputation for multivariate missing-data problems: a data analyst’s perspective. In: Multivariate Behavioral Research 33.4 (1998), pp. 545-571.
    DOI
  • Schafer, J. L. Multiple imputation: a primer. In: Statistical Methods in Medical Research 8.1 (1999), pp. 3-15.
    DOI
  • Stuart, E. A., M. Azur, C. Frangakis, et al. Multiple imputation with large data sets: a case study of the children’s mental health initiative. In: American Journal of Epidemiology 169.9 (2009), pp. 1133-1139.
    DOI
  • Su, Y. S., A. Gelman, J. Hill, et al. Multiple imputation with diagnostics (mi) in R: opening windows into the black box. In: Journal of Statistical Software 45 (2011), p. 2.
    DOI
  • Buuren, S. van, J. P. L. Brand, C. G. M. Groothuis-Oudshoorn, et al. Fully conditional specification in multivariate imputation. In: Journal of Statistical Computation and Simulation 76.12 (2006), pp. 1049-1064.
    DOI
  • Buuren, S. van and K. Groothuis-Oudshoorn. MICE: multivariate imputation by chained equations in R. In: Journal of Statistical Software 45 (2011), p. 3. eprint: NIHMS150003.
    DOI
  • Buuren, S. van. Multiple imputation of discrete and continuous data by fully conditional specification. In: Statistical Methods in Medical Research 16 (2007), pp. 219-242.
    DOI
  • Voillet, V., P. Besse, L. Liaubet, et al. Handling missing rows in multi-omics data integration: multiple imputation in multiple factor analysis framework. In: BMC Bioinformatics 17.402 (2016). Forthcoming.
    DOI
  • Wang, N. and J. M. Robins. Large-sample theory for parametric multiple imputation procedures. In: Biometrika 85.4 (1998), pp. 935–948.
    DOI
  • Xie, X. and X. L. Meng. Dissecting multiple imputation from a multi-phase inference perspective: what happens when God’s, imputer’s and analyst’s models are uncongenial? In: Statistica Sinica 27.4 (2017), pp. 1485–1594.
    DOI
  • Gondara, L. and K. Wang. MIDA: Multiple Imputation using Denoising Autoencoders. In: Proceedings of the 22nd Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2018). (Jun. 03, 2018-Jun. 06, 2018). Ed. by D. Phung, V. Tseng, G. Webb, B. Ho, M. Ganji and L. Rashidi. Lecture Notes in Computer Science. Springer International Publishing, 2018, pp. 260-272. ISBN: 3319930404.
  • Muzellec, B., J. Josse, C. Boyer, et al. Missing Data Imputation using Optimal Transport. In: International Conference on Machine Learning. PMLR. 2020, pp. 7130–7140.


The field of machine learning being dependent on the availability of (good) training data, it is – in most real-world applications – necessarily facing the issue of missing data. Hence there has been an increasing attention to how to handle missing data, in the features and the output, in order to learn accurately from the data.

Supervised learning

Methods to deal with supervised learning (predict as well as possible an outcome) with missing values in the covariates are really different from methods for inference with missing values (estimating parameters).

  • Ma, A. and D. Needell. Stochastic Gradient Descent for Linear Systems with Missing Data. In: Numerical Mathematics: Theory, Methods and Applications 12.1 (2017), pp. 1-20.
    DOI
  • Ipsen, N. B., P. Mattei, and J. Frellsen. not-MIWAE: Deep generative modelling with missing not at random data. In: arXiv preprint (2020). Ipsen, N., P. Mattei, and J. Frellsen. How to deal with missing data in supervised deep learning? In: ICML Workshop on the Art of Learning with Missing Values (Artemiss). 2020.
    URL
  • Le Morvan, M., N. Prost, J. Josse, et al. Linear predictor on linearly-generated data with missing values: non consistency and solutions. In: Proceedings of Machine Learning Research. Ed. by -. Vol. 108. Proceedings of Machine Learning Research. 2020, p. 3165–3174. eprint: 2002.00658v2.
    URL
  • Le Morvan, M., J. Josse, T. Moreau, et al. NeuMiss networks: differentiable programming for supervised learning with missing values. In: Advances in Neural Information Processing Systems, 33. (Dec. 2020). Ed. by -. IEEE, 2020. eprint: 2007.01627v4.
    URL
  • Sportisse, A., C. Boyer, A. Dieuleveut, et al. Debiasing Averaged Stochastic Gradient Descent to handle missing values. In: Advances in Neural Information Processing Systems, 33. (Dec. 2020). Ed. by -. IEEE, 2020. eprint: 2002.09338v2.
    URL
  • Le Morvan, M., J. Josse, E. Scornet, et al. What’s a good imputation to predict with missing values? 2021.
    URL


Unsupervised learning

Methods have been suggested to perform clustering with missing values (k-means, mixture models) as well as dimensionality reduction with missing values (PCA).

  • Brinis, S., C. Traina, and A. J. Traina. Hollow-tree: a metric access method for data with missing values. In: Journal of Intelligent Information Systems (2019), pp. 1–28.
    DOI
  • Hunt, L. and M. Jorgensen. Mixture model clustering for mixed data with missing information. In: Computational Statistics & Data Analysis 41.3-4 (2003), pp. 429–440.
    DOI
  • Chi, J. T., E. C. Chi, and R. G. Baraniuk. k-pod: A method for k-means clustering of missing data. In: The American Statistician 70.1 (2016), pp. 91–99.
    DOI
  • Josse, J., M. Chavent, B. Liquet, et al. Handling missing values with regularized iterative multiple correspondance analysis. In: Journal of Classification 29.1 (2012), pp. 91-116.
    DOI
  • Miao, W. and E. J. Tchetgen Tchetgen. Identification and inference with nonignorable missing covariate data. In: Statistica Sinica 28.4 (2018), pp. 2049–2067.
    DOI


Trees and forests

Decision trees are models based on recursive executions of elementary rules. This architecture grants them a variety of simple options to deal with missing values, without requiring prior imputation. A popular class of decision tree models is called random trees (or more generally random forests) and allows data analyses such as causal inference in the presence of missing values without the need of having to impute these missing values.

  • Beaulac, C. and J. S. Rosenthal. BEST: A decision tree algorithm that handles missing values. In: arXiv preprint (2018). eprint: 1804.10168.
    URL
  • Bertsimas, D., C. Pawlowski, and Y. D. Zhuo. From predictive methods to missing data imputation: an optimization approach. In: The Journal of Machine Learning Research 18.1 (2017), pp. 7133–7171.
  • Ding, Y. and J. S. Simonoff. An investigation of missing data methods for classification trees applied to binary response data. In: Journal of Machine Learning Research 11.1 (2010), pp. 131-170.
    URL
  • Hothorn, T., K. Hornik, and A. Zeileis. Unbiased Recursive Partitioning: A Conditional Inference Framework. In: Journal of Computational and Graphical Statistics 15.3 (2012), pp. 651-674.
    DOI
  • Josse, J., N. Prost, E. Scornet, et al. On the consistency of supervised learning with missing values. In: arXiv preprint (2019). arXiv: 1902.06931 [stat.ML].
    URL
  • Kapelner, A. and J. Bleich. Prediction with missing data via Bayesian additive regression trees. In: Canadian Journal of Statistics 43.2 (2015), pp. 224-239.
  • Khosravi, P., A. Vergari, Y. Choi, et al. Handling missing data in decision trees: A probabilistic approach. In: arXiv preprint arXiv:2006.16341 (2020).
  • Rahman, G. and Z. Islam. Missing value imputation using decision trees and decision forests by splitting and merging records: Two novel techniques. In: Knowledge-Based Systems 53 (2013), pp. 51–65.
  • Stekhoven, D. J. and P. Bühlmann. Missforest-non-parametric missing value imputation for mixed-type data. In: Bioinformatics 28.1 (2012), pp. 112-118. eprint: 1105.0828.
    DOI
  • Strobl, C., A. L. Boulesteix, and T. Augustin. Unbiased split selection for classification trees based on the Gini Index. In: Computational Statistics & Data Analysis 52.1 (2007), pp. 483-501.
    DOI
  • Tierney, N. J., F. A. Harden, M. J. Harden, et al. Using decision trees to understand structure in missing data. In: BMJ Open 5.6 (2015), p. e007450.
    DOI
  • Twala, B. E. T. H., M. C. Jones, and D. J. Hand. Good methods for coping with missing data in decision trees. In: Pattern Recognition Letters 29.7 (2008), pp. 950-956.
    DOI
  • Chen, T. and C. Guestrin. XGBoost: A Scalable Tree Boosting System. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (Aug. 13, 2016-Aug. 17, 2016). Ed. by -. New York, NY, USA: ACM, 2016, pp. 785-794. ISBN: 0450342322.
    DOI
  • Rieger, A., T. Hothorn, and C. Strobl. Random forests with missing values in the covariates. Tech. rep. 79. University of Munich, Department of Statistics, 2010.
    URL


Deep Learning

The advance and success of (deep) neural networks in many research and application areas such as computer vision and natural language processing has also re-discovered the problem of handling missing values. Indeed the question of training neural networks on incomplete data has been considered even before the latest rise of deep learning and is considered to be essential due to the impact of missingness on the feasibility and quality of various learning problems.

  • Bianchi, F. M., L. Livi, K. Ø. Mikalsen, et al. Learning representations of multivariate time series with missing data. In: Pattern Recognition 96 (2019), p. 106973.
    DOI
  • Ipsen, N. B., P. Mattei, and J. Frellsen. not-MIWAE: Deep generative modelling with missing not at random data. In: arXiv preprint (2020).
    URL
  • Sharpe, P. K. and R. J. Solly. Dealing with missing values in neural network-based diagnostic systems. In: Neural Computing & Applications 3.2 (1995), pp. 73-77.
    DOI
  • Śmieja, M., Ł. Struski, J. Tabor, et al. Processing of missing data by neural networks. In: Computing Research Repository abs/1805.07405 (2018). eprint: 1805.07405.
    URL
  • Sovilj, D., E. Eirola, Y. Miche, et al. Extreme learning machine for missing data using multiple imputations. In: Neurocomputing 174.A (2016), pp. 220-231.
    DOI
  • Zhang, H., P. Xie, and E. Xing. Missing Value Imputation Based on Deep Generative Models. In: Computing Research Repository abs/1808.01684 (2018).
    URL
  • Bengio, Y. and F. Gingras. Recurrent neural networks for missing or asynchronous data. In: Proceedings of the 8th International Conference on Neural Information Processing Systems. (Nov. 27, 1995-Dec. 02, 1995). Ed. by -. Cambridge, MA, USA: MIT Press, 1995, pp. 395-401.
    URL
  • Biessmann, F., D. Salinas, S. Schelter, et al. “Deep” Learning for Missing Value Imputation in Tables with Non-Numerical Data. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management. Ed. by -. CIKM ’18. Torino, Italy: ACM, 2018, pp. 2017–2025. ISBN: 978-1-4503-6014-2.
  • Gondara, L. and K. Wang. MIDA: Multiple Imputation using Denoising Autoencoders. In: Proceedings of the 22nd Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2018). (Jun. 03, 2018-Jun. 06, 2018). Ed. by D. Phung, V. Tseng, G. Webb, B. Ho, M. Ganji and L. Rashidi. Lecture Notes in Computer Science. Springer International Publishing, 2018, pp. 260-272. ISBN: 3319930404.
  • Goodfellow, I., M. Mirza, A. Courville, et al. Multi-Prediction Deep Boltzmann Machines. In: Proceedings of the 26th International Conference on Neural Information Processing Systems. (Dec. 05, 2013-Dec. 10, 2013). Ed. by C. Burges, L. Bottou, M. Welling, Z. Ghahramani and K. Weinberger. Advances in Neural Information Processing Systems 26. Curran Associates, Inc., 2013, pp. 548–556.
    URL
  • Ipsen, N. B., P. Mattei, and J. Frellsen. not-MIWAE: Deep generative modelling with missing not at random data. In: arXiv preprint (2020). Ipsen, N., P. Mattei, and J. Frellsen. How to deal with missing data in supervised deep learning? In: ICML Workshop on the Art of Learning with Missing Values (Artemiss). 2020.
    URL
  • Mattei, P. and J. Frellsen. MIWAE: Deep generative modelling and imputation of incomplete data sets. In: Proceedings of the 36th International Conference on Machine Learning. Vol. 97. Proceedings of Machine Learning Research. Kamalika Chaudhuri and Ruslan Salakhutdinov, 2019, pp. 4413–4423.
    URL
  • Le Morvan, M., J. Josse, T. Moreau, et al. NeuMiss networks: differentiable programming for supervised learning with missing values. In: Advances in Neural Information Processing Systems, 33. (Dec. 2020). Ed. by -. IEEE, 2020. eprint: 2007.01627v4.
    URL
  • Nowicki, R. K., R. Scherer, and L. Rutkowski. Novel rough neural network for classification with missing data. In: 21st International Conference on Methods and Models in Automation and Robotics (MMAR). (Sep. 29, 2016-Sep. 01, 2016). Ed. by -. IEEE, 2016, pp. 820–825.
    DOI
  • Tran, L., X. Liu, J. Zhou, et al. Missing Modalities Imputation via Cascaded Residual Autoencoder. In: 2017 IEEE Conference on Computer Vision and PAttern Recognition (CVPR). (Jul. 21, 2017-Jul. 26, 2017). Ed. by -. IEEE, 2017, pp. 4971-4980.
    DOI
  • Yoon, J., J. Jordon, and M. van der Schaar. GAIN: Missing Data Imputation using Generative Adversarial Nets. In: Proceedings of the 35th International Conference on Machine Learning. (Jul. 10, 2018-Jul. 15, 2018). Ed. by J. Dy and A. Krause. Vol. 80. Proceedings of Machine Learning Research. Stockholmsmässan, Stockholm Sweden: PMLR, 2018, pp. 5689–5698.
    URL
  • Yoon, S. and S. Sull. GAMIN: Generative Adversarial Multiple Imputation Network for Highly Missing Data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, pp. 8456–8464.
    URL
  • Londschien, M., S. Kovács, and P. Bühlmann. Change point detection for graphical models in presence of missing values. 2019. arXiv: 1907.05409 [stat.ML].


As mentioned in the above sections, it is necessary to make assumptions on the mechanism generating the missing values or response mechanism in order to work with missing values. Broadly speaking, these assumptions indicate how much the missingness is related to the data itself. The assumptions made on the mechanism impact further steps in the data analysis (since some types of missingness can induce a bias on the analysis results) and are therefore crucial for valid analyses of data in the presence of missing values.

More formally, both \(X\) and \(R\) are modeled as random variables and the response mechanism is defined as the conditional distribution of \(R\) given \(X\), \(\mathbb{P}_R(R|X)\). This distribution can depend on some parameter \(\phi\) so that we have \(\mathbb{P}_R(R|X;\phi)\). Little and Rubin (2002) defined three main categories of missing values depending on the form of the conditional distribution \(\mathbb{P}_R\):

  • Missing completely at random (MCAR): The missingness does not depend on the variables \(X=(X^\mathrm{obs},X^\mathrm{mis})\), denoting the observed variables and the missing ones as \(X^\mathrm{obs}\) and \(X^\mathrm{mis}\) respectively i.e.

    \[\mathbb{P}_R(R|X^\mathrm{obs},X^\mathrm{mis};\phi) = \mathbb{P}_R(R;\phi), \forall \phi\]

  • Missing at random (MAR): The missingness depends only on the observed variables \(X_{obs}\), i.e.

    \[\mathbb{P}_R(R|X^\mathrm{obs},X^\mathrm{mis};\phi) = \mathbb{P}_R(R|X^\mathrm{obs};\phi), \forall \phi,X^\mathrm{mis}\]

  • Missing not at random (MNAR): The missingness is said MNAR in all other cases, i.e. the missingness depends on the missing values and potentially also on the observed values.

To understand this definition, take the example of alcohol consumption: alcoholics are less inclined to reveal their alcohol consumption, therefore the probability of missing information on the alcohol consumption depends on the amount of consumption itself. Another simple example is the information on income or wealth which is missing more often for individuals of very high or very low income.

Note that MCAR is a special case of MAR and that these three categories are of increasing complexity with a large gap between the second and third. Indeed, most more or less generic methods which have been proposed in the last few decades are suited for data that is MAR. The case MNAR requires different techniques and further assumptions.

Note that Little and Rubin (2002) consider these three categories as really missing values as opposed to not really missing values where, in the case of categorical data, the missingness rather constitutes an additional category (for instance in a questionnaire with multiple choice answers, a participant can leave out a question because the category he wants to choose is not among the given choices).

Another – maybe complementary – approach to consider and study different missing values mechanisms and problems consists in using graphical models, for instance missingness graphs or m-graphs (Mohan et al., 2013). These allow to represent multivariate dependencies and to study identifiability or recoverability for different (estimation or prediction) problems.

Finally, another line of research considers the occurrence of missing values beforehand and addresses the question of how to anticipate or control the occurrence of missing values in a study design.

  • Wainer, H., ed. Drawing Inferences from Self-Selected Samples. New York, NY, USA: Springer, 1986.
  • Albert, P. S. and D. A. Follmann. Modeling repeated count data subject to informative dropout. In: Biometrics 56.3 (2000), pp. 667-677.
    DOI
  • Chen, Y. and M. Sadinle. Nonparametric Pattern-Mixture Models for Inference with Missing Data. In: arXiv preprint (2019). arXiv: 1904.11085 [stat.ME].
    URL
  • Diggle, P. and M. G. Kenward. Informative drop-out in longitudinal data analysis. In: Journal of the Royal Statistical Society, Series C (Applied Statistics) 43.1 (1994), pp. 49-93.
    DOI
  • Fang, F., J. Zhao, and J. Shao. Imputation-based adjusted score equations in generalized linear models with nonignorable missing covariate values. In: Statistica Sinica 28.4 (2018), pp. 1677–1701.
    DOI
  • Follmann, D. and M. Wu. An approximate generalized linear model with random effects for informative missing data. In: Biometrics 51.1 (1995), pp. 151-168.
    DOI
  • Gad, A. M. and N. M. M. Darwish. A shared parameter model for longitudinal data with missing values. In: American Journal of Applied Mathematics and Statistics 1.2 (2013), pp. 30-35.
    URL
  • Heckman, J. J. The common structure of statistical models of truncation, sample selection and limited dependent variables and a simple estimator for such models. In: Annals of Economic and Social Measurement 5.4 (1976), pp. 475-492.
    URL
  • Ibrahim, J. G., M. Chen, and S. R. Lipsitz. Missing responses in generalised linear mixed models when the missing data mechanism is nonignorable. In: Biometrika 88.2 (2001), pp. 551-564.
    DOI
  • Ibrahim, J. G., S. R. Lipsitz, and M. Chen. Missing Covariates in Generalized Linear Models When the Missing Data Mechanism is Non-Ignorable. In: Journal of the Royal Statistical Society. Series B (Statistical Methodology) 61.1 (1999), pp. 173-190.
  • Ipsen, N. B., P. Mattei, and J. Frellsen. not-MIWAE: Deep generative modelling with missing not at random data. In: arXiv preprint (2020).
    URL
  • Jamshidian, M., S. Jalal, and C. Jansen. MissMech: an R package for testing homoscedasticity, multivariate normality, and missing completely at random (MCAR). In: Journal of Statistical Software 56.6 (2014), pp. 1-31.
    DOI
  • Jamshidian, M. and S. Jalal. Tests of homoscedasticity, normality, and missing completely at random for incomplete multivariate data. In: Psychometrika 75.4 (2010), pp. 649-674. eprint: NIHMS150003.
    DOI
  • Larose, C., D. K. Dey, and O. Harel. The impact of missing values on different measures of uncertainty. In: Statistica Sinica 29.2 (2019), pp. 551–566.
    DOI
  • Lee, K. M., R. Mitra, and S. Biedermann. Optimal design when outcome values are not missing at random. In: Statistica Sinica 28.4 (2018), pp. 1821–1838.
    DOI
  • Lee, K. J., K. Tilling, R. P. Cornish, et al. Framework for the Treatment And Reporting of Missing data in Observational Studies: The Treatment And Reporting of Missing data in Observational Studies framework. In: Journal of clinical epidemiology 134 (2021), pp. 79–88.
  • Little, R. J. A. A test of missing completely at random for multivariate data with missing values. In: Journal of the American Statistical Association 83.404 (1988), pp. 1198-1202.
    DOI
  • Little, R. J. A. Pattern-mixture models for multivariate incomplete data. In: Journal of the American Statistical Association 88.421 (1993), pp. 125-134.
    DOI
  • Little, R. J. A. Modeling the drop-out mechanism in repeated-measures studies. In: Journal of the American Statistical Association 90.431 (1995), pp. 1112-1121.
    DOI
  • Miao, W. and E. J. Tchetgen Tchetgen. Identification and inference with nonignorable missing covariate data. In: Statistica Sinica 28.4 (2018), pp. 2049–2067.
    DOI
  • Molenberghs, G., B. Michiels, M. G. Kenward, et al. Monotone missing data and pattern-mixture models. In: Statistica Neerlandica 52.2 (1998), pp. 153-161.
    DOI
  • Nabi, R., R. Bhattacharya, and I. Shpitser. Full Law Identification In Graphical Models Of Missing Data: Completeness Results. In: arXiv preprint arXiv:2004.04872 (2020).
    URL
  • Reiter, J. P. and M. Sadinle. Itemwise conditionally independent nonresponse modelling for incomplete multivariate data. In: Biometrika 104.1 (Jan. 2017), pp. 207-220. eprint: http://oup.prod.sis.lan/biomet/article-pdf/104/1/207/13066719/asw063.pdf.
  • Rioux, C., A. Lewin, O. A. Odejimi, et al. Reflection on modern methods: planned missing data designs for epidemiological research. In: International Journal of Epidemiology (2020).
    DOI
  • Robins, J. M., A. Rotnitzky, and L. P. Zhao. Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. In: Journal of the American Statistical Association 90.429 (1995), pp. 106-121.
    DOI
  • Rotnitzky, A., J. M. Robins, and D. O. Scharfstein. Semiparametric regression for repeated outcomes with nonignorable nonresponse. In: Journal of the American Statistical Association 93.444 (1998), pp. 1321-1339.
    DOI
  • Sadinle, M. and J. P. Reiter. Sequential Identification of Nonignorable Missing Data Mechanisms. In: Statistica Sinica 28.4 (2018), pp. 1741–1759.
    DOI
  • Sadinle, M. and J. P. Reiter. Sequentially additive nonignorable missing data modeling using auxiliary marginal information. In: arXiv preprint (2019). arXiv: 1902.06043 [stat.ME].
    URL
  • Santos, M. S., R. C. Pereira, A. F. Costa, et al. Generating Synthetic Missing Data: A Review by Missing Mechanism. In: IEEE Access 7 (2019), pp. 11651–11667. Generating Synthetic Missing Data: A Review by Missing Mechanism. In: IEEE Access 7 (2019), pp. 11651–11667.
    DOI
  • Santos, M. S., R. C. Pereira, A. F. Costa, et al. Generating Synthetic Missing Data: A Review by Missing Mechanism. In: IEEE Access 7 (2019), pp. 11651–11667.
    DOI
  • Seaman, S., J. Galati, D. Jackson, et al. What Is Meant by “Missing at Random”? In: Statistical Science 28.2 (2013), pp. 257–268. What Is Meant by" Missing at Random"? In: Statistical Science (2013), pp. 257–268.
  • Seaman, S., J. Galati, D. Jackson, et al. What Is Meant by “Missing at Random”? In: Statistical Science 28.2 (2013), pp. 257–268.
  • Shao, J. and J. Zhang. A transformation approach in linear mixed-effects models with informative missing responses. In: Biometrika 102.1 (2015), pp. 107-119.
    DOI
  • Simon, G. A. and J. S. Simonoff. Diagnostic plots for missing data in least squares regression. In: Journal of the American Statistical Association 81.394 (1986), pp. 501-509.
    DOI
  • Stubbendick, A. L. and J. G. Ibrahim. Maximum Likelihood Methods for Nonignorable Missing Responses and Covariates in Random Effects Models. In: Biometrics 59.4 (2003), pp. 1140–1150.
    DOI
  • Stubbendick, A. L. and J. G. Ibrahim. Likelihood-based inference with nonignorable missing responses and covariates in models for discrete longitudinal data. In: Statistica Sinica 16.4 (2006), pp. 1143–1167.
    URL
  • Tchetgen Tchetgen, E. J., L. Wang, and B. Sun. Discrete choice models for nonmonotone nonignorable missing data: identification and inference. In: Statistica Sinica 28.4 (2018), pp. 2069–2088.
    DOI
  • Templ, M., A. Alfons, and P. Filzmoser. Exploring Incomplete data using visualization techniques. In: Advances in Data Analysis and Classification 6.1 (2012), pp. 29-47.
    DOI
  • Thijs, H., G. Molenberghs, B. Michiels, et al. Strategies to fit pattern-mixture models. In: Biostatistics 3.2 (2002), pp. 245-265.
    DOI
  • Tierney, N. J., F. A. Harden, M. J. Harden, et al. Using decision trees to understand structure in missing data. In: BMJ Open 5.6 (2015), p. e007450.
    DOI
  • Vansteelandt, S., A. Rotnitzky, and J. Robins. Estimation of regression models for the mean of repeated outcomes under nonignorable nonmonotone nonresponse. In: Biometrika 94.4 (2007), pp. 841–860.
    DOI
  • Verbeke, G., G. Molenberghs, H. Thijs, et al. Sensitivity analysis for nonrandom dropout: a local influence approach. In: Biometrics 57.1 (2001), pp. 7-14.
    DOI
  • White, I. R., J. Carpenter, and N. J. Horton. A mean score method for sensitivity analysis to departures from the missing at random assumption in randomised trials. In: Statistica Sinica 28.4 (2018), pp. 1985–2003.
    DOI
  • Wu, M. C. and R. J. Carroll. Estimation and comparison of changes in the presence of informative right censoring by modeling the censoring process. In: Biometrics 44.1 (1988), pp. 175-188.
    DOI
  • Zhao, J. and Y. Ma. A versatile estimation procedure without estimating the nonignorable missingness mechanism. In: Journal of the American Statistical Association (2021), pp. 1–15.
    DOI
  • Zhou, Y., R. J. A. Little, and J. D. Kalbfleisch. Block-conditional missing at random models for missing data. In: Statistical Science 25.4 (2010), pp. 517–532.
    DOI
  • Gill, R. D., M. J. Van Der Laan, and J. M. Robins. Coarsening at random: Characterizations, conjectures, counter-examples. In: Proceedings of the First Seattle Symposium in Biostatistics. Springer. 1997, pp. 255–294.
    DOI
  • Mohan, K., F. Thoemmes, and J. Pearl. Estimation with Incomplete Data: The Linear Case. In: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18. International Joint Conferences on Artificial Intelligence Organization, Jul. 2018, pp. 5082–5088.
  • Sportisse, A., C. Boyer, and J. Josse. Estimation with informative missing data in the low-rank model with random effects. In: Advances in Neural Information Processing Systems, 33. (Dec. 2020). Ed. by -. IEEE, 2020. eprint: 1906.02493v3.
    URL
  • Mohan, K. and J. Pearl. Graphical Models for Processing Missing Data. Tech. rep. R-473-L. Forthcoming, Journal of American Statistical Association (JASA). CA: Department of Computer Science, University of California, Los Angeles, 2019.
    URL
  • Tierney, N. and D. Cook. Expanding tidy data principles to facilitate missing data exploration, visualization and assessment of imputations. Monash Econometrics and Business Statistics Working Papers 14/18. Monash University, Department of Econometrics and Business Statistics, 2018.
    URL


Share