How to simulate missing values?

Teresa Alves de Sousa, Imke Mayer, Thomas Schmitt

Missing values occur in many domains and most datasets contain missing values (due to non-responses, lost records, machine failures, dataset fusions, etc.). These missing values have to be considered before or during analyses of these datasets.

Now, if you have a method that deals with missing values, for instance imputation or estimation with missing values, how can you assess the performance of your method on a given dataset? If the data already contains missing values, than this does not help you since you generally do not have a ground truth for these missing values. So you will have to simulate missing values, i.e. you remove values -- which you therefore know to be the ground truth -- to generate missing values.

The mechanisms generating missing values can be various but usually they are classified into three main categories defined by Rubin 1976: missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR). The first two are also qualified as ignorable missing values mechanisms, for instance in likelihood-based approaches to handle missing values, whereas the MNAR mechanism generates nonignorable missing values. In the following we will briefly introduce each mechanism (with the definitions used widely in the literature) and propose ways of simulations missing values under these three mechanism assumptions. For more precise definitions we refer to references in the bibliography on the R-miss-tastic website.

Notations

Let's denote by $\mathbf{X}\in\mathcal{X_1}\times\dots\times\mathcal{X_p}$ the complete observations. We assume that $\mathbf{X}$ is a concatenation of $p$ columns $X_j\in\mathcal{X_j}$, $j\in\{1,\dots,p\}$, where $dim(\mathcal{X_j})=n$ for all $j$.

The data can be composed of quantitative and/or qualitative values, hence $\mathcal{X_j}$ can be $\mathbb{R}^n$, $\mathbb{Z}^n$ or more generally $\mathcal{S}^n$ for any discrete set $S$.

Missing values are indicated as np.nan (not a number) and we define an indicator matrix $\mathbf{R}\in\{0,1\}^{n\times p}$ such that $R_{ij}=1$ if $X_{ij}$ is observed and $R_{ij}=0$ otherwise. We call this matrix $\mathbf{R}$ the response (or missingness) pattern of the observations $\mathbf{X}$. According to this pattern, we can partition the observations $\mathbf{X}$ into observed and missing: $\mathbf{X} = (\mathbf{X}^{obs}, \mathbf{X}^{mis})$.

We generate a small example of two dimensional normal observations $\mathbf{X} \sim \mathbf{N}(\mu, I_2), \mu \in \mathbb{R}^2$

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [2]:
# Generate the complete data

np.random.seed(0)  # fix the seed 

n_samples = 900

# Generate normal data with Id_2 covariance.
mean = (0,0)
cov = [[1,0],[0,1]]
X_complete = np.random.multivariate_normal(mean, cov, size=n_samples)
# X_complete = np.random.uniform(0, 1, size=(n_samples,2))

# plot the data
sns.jointplot(X_complete[:,0], X_complete[:,1], label = 'Complete data');

 Amputation

We want to simulate a matrix X_obs from X_complete with missing entries (remplacing some entries by np.nan).
To do so we sample the indicator matrix R (missing pattern) with same shape as X_complete indicating non missing entries:

$X_{obs} = X_{complete} \cdot \mathbb{1}_{\{R=1\}} + \mbox{np.nan} \cdot \mathbb{1}_{\{R=0\}}$

The mechanisms generating the indicator matrix (thus the missing values) can be various but usually they are classified into three main categories defined by Rubin 1976, see the bibliography.

  1. Missing Completly at Random (MCAR)
  2. Missing at Random (MAR)
  3. Missing Not at Random (MNAR)

 1. MCAR Mechanism (Missing Completly at Random)

The observations are said to be Missing Completely At Random (MCAR) if the probability that an observation is missing is independent of the variables and observations: the probability that an observation is missing does not depend on $(\mathbf{X}^{obs},\mathbf{X}^{mis})$. Formally this is: $$\mathbb{P}_R(R\,|\, X^{obs}, X^{mis}; \phi) = \mathbb{P}_R(R) \qquad \forall \, \phi.$$

In Python, the easiest way is to sample a Bernouilli mask

In [3]:
import warnings

def ampute_mcar(X_complete, missing_rate = .2):
    # Mask completly at random some values
    
    M = np.random.binomial(1, missing_rate, size = X_complete.shape)
    X_obs = X_complete.copy()
    np.putmask(X_obs, M, np.nan)
    print('Percentage of newly generated mising values: {}'.\
      format(np.round(np.sum(np.isnan(X_obs))/X_obs.size,3)))
    
    # warning if a full row is missing
    for row in X_obs:
        if np.all(np.isnan(row)):
            warnings.warn('Some row(s) contains only nan values.')
            break

    # warning if a full col is missing
    for col in X_obs.T:
        if np.all(np.isnan(col)):
            warnings.warn('Some col(s) contains only nan values.')
            break
            
    return X_obs
In [4]:
X_obs_mcar = ampute_mcar(X_complete)
Percentage of newly generated mising values: 0.192
/home/thomas/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:15: UserWarning: Some row(s) contains only nan values.
  from ipykernel import kernelapp as app
In [5]:
print('X_obs_mcar = ')
print(X_obs_mcar[:10])
print('  ...')
X_obs_mcar = 
[[ 1.76405235         nan]
 [ 0.97873798  2.2408932 ]
 [ 1.86755799 -0.97727788]
 [        nan         nan]
 [-0.10321885  0.4105985 ]
 [ 0.14404357  1.45427351]
 [ 0.76103773  0.12167502]
 [ 0.44386323  0.33367433]
 [        nan -0.20515826]
 [        nan         nan]]
  ...
In [6]:
# ploting functions with seaborn
# scripts are at: https://github.com/R-miss-tastic/website/blob/master/static/how-to/python/utils_plot.py
from utils_plot import (hist_plot, scatter_plot_obs,
                        scatter_plot_with_missing_completed)
In [7]:
print('Data with "MCAR" mechanism.')
scatter_plot_obs(X_obs_mcar)
Data with "MCAR" mechanism.
In [8]:
print('Plot completed with X_complete information.')
scatter_plot_with_missing_completed(X_obs_mcar, X_complete)
Plot completed with X_complete information.
In [9]:
hist_plot(X_obs_mcar, X_complete)

From plots and histogram below, we can infer that missingness is independant on the observed and missing values.

2. MAR Mechanism (Missing at Random)

The observations are said to be Missing At Random (MAR) if the probability that an observation is missing only depends on the observed data $\mathbf{X}^{obs}$. Formally,

$$\mathbb{P}_R(R\,|\,X^{obs},X^{mis};\phi)=\mathbb{P}_R(R\,|\,X^{obs};\phi) \qquad \forall \,\phi,\, \forall \, X^{mis}.$$

Contrary to R, there is no python implementation of amputation yet.

In [10]:
# The following code is work in progress
from sklearn.preprocessing import normalize

def ampute_mar(X_complete, missing_rate=.2, W=None):
    """ Observed values will censor the missing ones
    
    The proba of being missing: M_proba = X_obs.dot(W)
    So for each sample, some observed feature (P=1) will influence 
    the missingness of some others features (P=0) w.r.t to the weight 
    matrix W (shape n_features x n_features).
    
    e.g. during a questionnary, those who said being busy (X_obs[:,0] = 1) 
    usualy miss to fill the last question (X_obs[:,-1] = np.nan)
    So here W[0,-1] = 1
    """
    X_obs = X_complete.copy()
    M_proba = np.zeros(X_obs.shape)
    
    if W is None:
        # generate the weigth matrix W
        W = np.random.randn(X_complete.shape[1], X_complete.shape[1])

    # Severals iteration to have room for high missing_rate
    for i in range(X_obs.shape[1]*2):
        # Sample a pattern matrix P
        # P[i,j] = 1 will correspond to an observed value
        # P[i,j] = 0 will correspond to a potential missing value
        P = np.random.binomial(1, .5, size=X_complete.shape)

        # potential missing entry do not take part of missingness computation
        X_not_missing = np.multiply(X_complete,P)

        # sample from the proba X_obs.dot(W)
        sigma = np.var(X_not_missing)
        M_proba_ = np.random.normal(X_not_missing.dot(W), scale = sigma)

        # not missing should have M_proba = 0
        M_proba_ = np.multiply(M_proba_, 1-P)  # M_proba[P] = 0
        
        M_proba += M_proba_

    thresold = np.percentile(M_proba.ravel(), 100 * (1 - missing_rate))
    M = M_proba > thresold

    np.putmask(X_obs, M, np.nan)
    print('Percentage of newly generated mising values: {}'.\
      format(np.sum(np.isnan(X_obs))/X_obs.size))
    return X_obs
In [11]:
W = np.array([[0,10],[0,0]]) 
# With this weight matrix W, 
# missingness of X[:,1] depends on X[:,0] values
X_obs_mar = ampute_mar(X_complete, W=W)
Percentage of newly generated mising values: 0.2
In [12]:
print('Data with "MAR" mechanism.')
scatter_plot_obs(X_obs_mar)
Data with "MAR" mechanism.

From the scatter plot we see that the missing values with nan on the X[:,1] coordinate (y-axis), which are represented as orange star at the bottom, are highly correlated to the X[:,0] values (x-axis).

X[i,0] > .5 => X[i,1] = np.nan with high probability.

In [13]:
print('Samples completed with X_complete information.')
scatter_plot_with_missing_completed(X_obs_mar, X_complete)
Samples completed with X_complete information.
In [14]:
hist_plot(X_obs_mar, X_complete)

Histogram are overlaping : missingness does not depends on the missing values.

3. MNAR (Missing Not At Random)

Otherwise, if neither MCAR nor MAR, values are Missing Not At Random (MNAR)

Here we ampute high values with high probability. It might be called "censoring"

In [15]:
from scipy.special import expit as sigmoid  # logistic function

def ampute_mnar(X_complete, missing_rate= .2):
    """ ampute X_complete with censoring (Missing Not At Random)
    
    The missingness depends on the values.
    This will tends to "censor" X[i,j] where X[i,j] is high 
    comparing to its column X[:,j]
    """
    
    # M depends on X_complete values
    M_proba = np.random.normal(X_complete)
    M_proba = normalize(M_proba, norm='l1')
    
    # compute thresold wrt missing_rate
    thresold = np.percentile(M_proba.ravel(), 100 * (1- missing_rate))
    M = M_proba > thresold
    
    X_obs = X_complete.copy()
    np.putmask(X_obs, M, np.nan)
    print('Percentage of newly generated mising values: {}'.\
      format(np.sum(np.isnan(X_obs))/X_obs.size))
    return X_obs
In [16]:
X_obs_mnar = ampute_mnar(X_complete)
Percentage of newly generated mising values: 0.2
In [17]:
scatter_plot_obs(X_obs_mnar)
print('X_obs with "MNAR" mechanism')
X_obs with "MNAR" mechanism
In [18]:
print('Samples completed with X_complete information.')
scatter_plot_with_missing_completed(X_obs_mnar, X_complete)
Samples completed with X_complete information.
In [19]:
hist_plot(X_obs_mnar, X_complete)

The distribution of missing values is shifted to the right (comparing to the observed values' distribution). The missingness pattern depends on the values of the missing entries.

 Summary

  1. Mising completly at random (MCAR): the missingness mechanism could be forget, nan are distributed as white noise.
  2. Missing at random (MAR): the missingness mechanism depends on the observed value
  3. Missing not at random (MNAR): missingness might carry information about the missing values.

 Additional remarks

See also the R ampute algorithm from the R package Mice

This workflow in R

We can also include the label y is the data. So the missingness could also depends on the target (We might call it "predicitve missingness").