
A resource website on missing values - Methods and references for managing missing data

On this platform we attempt to give you an overview of main references on missing values. We do not claim to gather all available references on the subject but rather to offer a peak into different fields of active research on handling missing values, allowing for an introductory reading as well as a starting point for further bibliographical research.

Inspired by CRAN Task View on Missing Data and a review of Imbert & Vialaneix on handling missing values (2018, written in French) we organized our selection of relevant references on missing values by different topics.

In order to provide a more formal introduction for the problem of missing values and the existing methods to handle them (e.g. diagnose/describe the missingness or perform statistical analysis on the incomplete data), we introduce some farely standard definitions and notations used in the remainder of this article.

  • Let \(X=(X_1,\dots, X_p)\) be a vector of \(p\) random variables which can be continuous or categorical.

  • We note \(x_{ij}\) the observation of variable \(X_j\) for an individual \(i\in\{1,\dots,n\}\) and \(\mathbf{x}_i=(x_{i1},\dots,x_{ip})\) the vector of observations of all \(p\) variables \(X\) for the individual \(i\).

  • The observations of the \(n\) individuals are stacked by rows in a matrix \(\mathbf{X}\in\mathbb{R}^{n\times p}\).

  • The indicator matrix of missing values \(\mathbf{R}\) is defined such that its values \((r_{ij})_{\substack{i=1,\dots,n\\j=1,\dots,p}}\) are given by: \(r_{ij} = \left\{\begin{array}{ll}1 & \text{ if } x_{ij} \text{ is observed}\\0 & \text{ otherwise}\end{array}\right. = \mathbb{1}_{x_{ij}\, is\, observed}\). The associated random variable is denoted by \(R\).

  • The observed and missing parts of \(X\) are denoted respectively by \(X_{obs}\) and \(X_{mis}\).

These general references and reviews are helpful to get started with the large field of missing values as they provide an introduction to the main concepts and methods or give an overview of the diversity of topics in statistical analysis related to missing values. They discuss different mechanisms that generated the missing values, necessary conditions for working consistently on the observed values alone and ways to impute, i.e. complete, the missing values to end up with complete datasets allowing the use of standard statistical analysis methods.

If you are rather new to the subject and wish to start with less formal and more application-based introductions or if you look for general high-level advices on handling missing data we suggest the following publications:

Furthermore you can have a look at the following statistical journals which regularly contain recent results related to handling missing data:

The first intuitive and probably most applied solution in data analyses to deal with missing values is to delete the partial observations and to work excusively on the individuals with complete information. This has several drawbacks, among others it introduces an estimation bias in most cases (more precisely in cases where the missingness is not independent of the data). In order to reduce this bias one can reweight the complete observations to compensate for the deletion of incomplete individuals in the dataset. The weights are defined by inverse probabilities, for instance the inverse of the probability for each individual of being fully observed. This method is known as inverse probability weighting and is described in detail in the publications below. We split the references in two parts: handling missing values in survey data and performing causal inference in the presence of missing values, both requiring the use of weighting methods.

For survey data analysis

Such weighting methods are widely used on survey data in order to correct for unbalanced sampling fractions by balancing the empirical distributions of the observed covariates to recover the structure of the target population.

Methods in common with causal inference

Inverse probability weighting is also considered in causal inference: A bias is induced by the presence of confounders, i.e. variables which interact with both covariates and outcome. Hence, if the goal is to estimate causal relationships between covariates and outcome it is necessary to account for the potential effect of confounders – a selection bias – on the result of causal inference.

Let \(x_i\) be an observation with missing values, e.g. each entry of \(x_i\) could be the temperature at a certain day for one given place and unfortunately for some days the temperature was not measured. An intuitive idea to replace this missing information could be: take other observations \(\{x_j\}_j\) which are similar to \(x_i\) at the observed values and use this information to fill in the gaps. This idea of taking observed values from neighbours or donors based on some similarity measure is implemented in the so-called hot-deck and k-nearest-neighbors (kNN) approaches.

The most popular approach to deal with missing values for statistical inference tasks is likelihood-based approaches that can deal with incomplete data. More precisely, if the missingness mechanism is ignorable (in a certain sense that is explained in the Missing values mechanisms section) then one can attempt to infer the model parameters by maximizing the likelihood on the observed values. When the mechanism cannot be ignored, then a specific model for it needs to be assumed. The main algorithm available for performing maximum likelihood estimation (ML) with missing values, is the Expectation Maximization (EM) algorithm. This algorithm requires the knowledge of the joint distribution of \(X = (X_{obs}, X_{mis})\) and its implementation is not straightforward since it involves integrals which cannot always be computed easily. Once the model parameters are estimated, one can impute the missing values using this estimated information on the data model.

And there exist also other methods that allow for statistical inference with missing values and that are not using likelihood maximization.

There is a vast literature on how to perform (linear) regression, possibly in high dimensional setting, in presence of missing values in the covariates. This can be seen as a particular case of supervised learning, which is presented below even if the focus is often more on estimating parameters or selecting relevant variables.

In the previously mentioned EM algorithm there is in fact an implicit step called imputation: imputing a missing value means replacing it with a plausible one. The definition of plausibility is not stated explicitly but can be deduced from the used method to fill in the gaps, for instance one could choose to replace all missing values of a certain variable \(X_j\) by the average observed value \(\frac{1}{n_{obs,j}}\sum_{i} x_{ij}\mathbb{1}_{\{x_{ij} \, is\, observed\}}\), where \(n_{obs,j} = \sum_{i} \mathbb{1}_{\{x_{ij} \, is\, observed\}}\). The interest of imputation is manifold: (1) it allows to use all information in the sample (instead of deleting incomplete observations which leads to a decreasing power in the statistical analysis), (2) if there is sufficient data, i.e. sufficient observations, then the imputation can be very accurate and this assures good quality of future statistical analyses and (3) the imputed dataset is a complete dataset and one can apply standard statistical inference methods. The latter however has to be treated with caution since it implies that in the statistical analysis one does not make any distinction between observed values and imputed values anymore. We will come back to this issue in the next section on multiple imputation.

Matrix factorization

A special case of imputation is matrix completion that exploits structural assumptions about the row and column spaces to impute the missing values.

A major drawback of single imputation, i.e. where every missing value is replaced by a single most plausible value, consists in the underestimation of the overall variance of the data and inferred parameters. Indeed, by replacing every missing value by a given plausible one and by applying generic statistical methods on the completed dataset, one makes no difference between initially observed and unobserved data anymore. Therefore the variability due to the uncertainty of the missing values is not reflected in future statistical analyses which treat the dataset as if it had been fully observed from the beginning. A nice and conceptually simple workaround for this problem is multiple imputation: instead of generating a single complete dataset by a given imputation method one imputes every missing value by several possible values. Statistical analysis is then applied on each of the imputed datasets and the resulting estimations are aggregated and used to estimate the sample variance and the variance due to the uncertainty in the missing values.

The field of machine learning being dependent on the availability of (good) training data, it is – in most real-world applications – necessarily facing the issue of missing data. Hence there has been an increasing attention to how to handle missing data, in the features and the output, in order to learn accurately from the data.

Supervised learning

Methods to deal with supervised learning (predict as well as possible an outcome) with missing values in the covariates are really different from methods for inference with missing values (estimating parameters).

Unsupervised learning

Methods have been suggested to perform clustering with missing values (k-means, mixture models) as well as dimensionality reduction with missing values (PCA).

Trees and forests

Decision trees are models based on recursive executions of elementary rules. This architecture grants them a variety of simple options to deal with missing values, without requiring prior imputation. A popular class of decision tree models is called random trees (or more generally random forests) and allows data analyses such as causal inference in the presence of missing values without the need of having to impute these missing values.

Deep Learning

The advance and success of (deep) neural networks in many research and application areas such as computer vision and natural language processing has also re-discovered the problem of handling missing values. Indeed the question of training neural networks on incomplete data has been considered even before the latest rise of deep learning and is considered to be essential due to the impact of missingness on the feasibility and quality of various learning problems.

As mentioned in the above sections, it is necessary to make assumptions on the mechanism generating the missing values or response mechanism in order to work with missing values. Broadly speaking, these assumptions indicate how much the missingness is related to the data itself. The assumptions made on the mechanism impact further steps in the data analysis (since some types of missingness can induce a bias on the analysis results) and are therefore crucial for valid analyses of data in the presence of missing values.

More formally, both \(X\) and \(R\) are modeled as random variables and the response mechanism is defined as the conditional distribution of \(R\) given \(X\), \(\mathbb{P}_R(R|X)\). This distribution can depend on some parameter \(\phi\) so that we have \(\mathbb{P}_R(R|X;\phi)\). Little and Rubin (2002) defined three main categories of missing values depending on the form of the conditional distribution \(\mathbb{P}_R\):

  • Missing completely at random (MCAR): The missingness does not depend on the variables \(X=(X^\mathrm{obs},X^\mathrm{mis})\), denoting the observed variables and the missing ones as \(X^\mathrm{obs}\) and \(X^\mathrm{mis}\) respectively i.e.

    \[\mathbb{P}_R(R|X^\mathrm{obs},X^\mathrm{mis};\phi) = \mathbb{P}_R(R;\phi), \forall \phi\]

  • Missing at random (MAR): The missingness depends only on the observed variables \(X_{obs}\), i.e.

    \[\mathbb{P}_R(R|X^\mathrm{obs},X^\mathrm{mis};\phi) = \mathbb{P}_R(R|X^\mathrm{obs};\phi), \forall \phi,X^\mathrm{mis}\]

  • Missing not at random (MNAR): The missingness is said MNAR in all other cases, i.e. the missingness depends on the missing values and potentially also on the observed values.

To understand this definition, take the example of alcohol consumption: alcoholics are less inclined to reveal their alcohol consumption, therefore the probability of missing information on the alcohol consumption depends on the amount of consumption itself. Another simple example is the information on income or wealth which is missing more often for individuals of very high or very low income.

Note that MCAR is a special case of MAR and that these three categories are of increasing complexity with a large gap between the second and third. Indeed, most more or less generic methods which have been proposed in the last few decades are suited for data that is MAR. The case MNAR requires different techniques and further assumptions.

Note that Little and Rubin (2002) consider these three categories as really missing values as opposed to not really missing values where, in the case of categorical data, the missingness rather constitutes an additional category (for instance in a questionnaire with multiple choice answers, a participant can leave out a question because the category he wants to choose is not among the given choices).

Another – maybe complementary – approach to consider and study different missing values mechanisms and problems consists in using graphical models, for instance missingness graphs or m-graphs (Mohan et al., 2013). These allow to represent multivariate dependencies and to study identifiability or recoverability for different (estimation or prediction) problems.

Finally, another line of research considers the occurrence of missing values beforehand and addresses the question of how to anticipate or control the occurrence of missing values in a study design.

