On the consistency of supervised learning with missing values

Authors: Julie Josse (CMAP, Inria), Nicolas Prost (CMAP, Inria), Erwan Scornet (CMAP), Gaël Varoquaux (Inria).

This notebook is a tutorial from this paper's experimentations. Notebook by Nicolas Prost and Thomas Schmitt

Regression problems

Data generation ($X_{complete}, y$)

We consider 3 canonical regression problems where $X_{complete}$ is gaussian, and $y$ is either:

  • the first coordinate of $X$ with noise ($y = X_0 + \varepsilon$)
  • a linear function of $X$ with noise ($y = \beta X + \varepsilon$)
  • a quadratic function of the first coordinate with noise ($y = X_0^2 + \varepsilon$)
In [7]:
import numpy as np
from sklearn.datasets import make_friedman1

def generate_without_missing_values(data='simple', n_samples=200,
                                    n_features=2, random_state=0):
    """generate canonical regression data"""

    assert data in ['simple', 'linear', 'quadratic', 'friedman']
    np.random.seed(random_state)
    mean = np.ones(n_features)
    ro = .5
    cov = ro * np.ones((n_features, n_features)) +\
          (1 - ro) * np.eye(n_features)
    X = np.random.multivariate_normal(mean, cov, size=n_samples)
    epsilon = 0.1 * np.random.randn(n_samples)

    if data == 'simple':
        y = X[:, 0] + epsilon
    if data == 'linear':
        beta = [1, 2] + list(np.random.randn(n_features-2))
        y = X.dot(beta) + epsilon
    if data == 'quadratic':
        y = X[:, 0] * X[:, 0] + epsilon
    if data == 'friedman':  # X is no more gaussian here
        X, y = make_friedman1(n_samples=n_samples,
                              n_features=max(5, n_features),
                              noise=0.1, random_state=random_state)
    return X, y


def generate_with_missing_values(data='simple', n_samples=200,
                                 n_features=2, random_state=0,
                                 missing_mechanism='mcar',
                                 missing_rate=0.1):
    """Generate regression data with missing values"""

    X, y = generate_without_missing_values(data, n_samples,
                                           n_features, random_state)

    if missing_mechanism == 'mcar':
        missing_mask = np.random.binomial(1, missing_rate,
                                          (n_samples, n_features))
    elif missing_mechanism == 'censored':
        # p = .7  # percentile
        # B = np.random.binomial(1, missing_rate,
        #                        (n_samples, n_features))
        missing_mask = (X > np.percentile(X, 100*(1-missing_rate)))  # * B
    elif missing_mechanism == 'predictive':
        missing_mask = np.random.binomial(1, missing_rate,
                                          (n_samples, n_features))
        y += 2 * missing_mask[:, 0]

    X_complete = X.copy()
    np.putmask(X, missing_mask, np.nan)

    return X, y, X_complete, missing_mask
In [8]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.figure(figsize=(10,10))
for id_plot, data in enumerate(['simple','linear','quadratic']):
    X, y = generate_without_missing_values(data=data)
    plt.subplot(2,2,id_plot+1)
    plt.scatter(X[:, 0], X[:, 1], c=y)
    plt.xlabel('$X_0$')
    plt.ylabel('$X_1$')
    clb = plt.colorbar()
    clb.ax.set_title("y")
    if data == 'simple':
        plt.title('$y = X_0 + ε$')
    if data == 'linear':
        plt.title('$y = β X + ε$')
    if data == 'quadratic':
        plt.title('$y = X_0² + ε$')

Data generation with missingness ($X, y$)

We consider three missing data mechanisms $M$ st. $X = X_{complete}.\mathbb{1}_{[M=0]} + \mbox{'NA'}.\mathbb{1}_{[M=1]}$

  1. MCAR: values are missing completely at random, $M \sim \mathcal{B}(p)$
  2. Censored MNAR: values are missing only for high values of each variable, $M \sim \mathbb{1}_{[X>\alpha]}$
  3. Predictive missingness: $y$ depends of the missing mechanism, $y = f(X) + M$
In [16]:
plt.figure(figsize=(15,10))
for id_mechanism, mechanism in enumerate(['mcar','censored', 'predictive']):
    X, y, X_complete, M = generate_with_missing_values(missing_mechanism=mechanism)
    row_with_missing = [any(np.isnan(x)) for x in X]
    plt.subplot(2,2,id_mechanism+1)
    plt.scatter(X[:, 0], X[:, 1], c=y, label='non-missing', alpha = 1)
    if mechanism == 'predictive':
        plt.scatter(X_complete[row_with_missing, 0], X_complete[row_with_missing, 1],
                    marker='*', c=y[row_with_missing], s = 60, label='missing')
    else:
        plt.scatter(X_complete[row_with_missing, 0], X_complete[row_with_missing, 1],
                    marker='*', c='r', s = 60, label='missing')
    plt.legend()
    plt.xlabel("X0")
    plt.ylabel("X1")
    if id_mechanism == 2:
        clb = plt.colorbar()
        clb.ax.set_title("y")
    plt.title(mechanism)

Imputers and estimators

Imputation strategies

We consider three imputation strategies:

  1. Mean imputation: replace 'NA' by the mean of the feature (column)
  2. Iterative imputation: each feature is regressed from the others
  3. MIA (Missing In Attribute, by Twala et al.): duplicating features twice and remplacing its missing values once by $\infty$ and once by $- \infty$. This imputation conbined with boosting tree implicitly correspond to how lightGBM splits the data.

In the case of Mean or Iterative imputation, we also consider adding or not the missing values indicator (by concatening the indicator ("mask") to $X$)

In [10]:
# examples of imputation
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, IterativeImputer

plt.figure(figsize=(10,5))
for id_mechanism, mechanism in enumerate(['mcar','censored']):
    X, y, X_complete, M = generate_with_missing_values(missing_mechanism=mechanism,
                                                       n_features=10, n_samples=100)
    
    row_with_missing = [any(np.isnan(x)) for x in X]
    X_hat_mean = SimpleImputer(strategy='mean').fit_transform(X)
    X_hat_iterative = IterativeImputer().fit_transform(X)
    plt.subplot(1,2,id_mechanism+1)
    plt.scatter(X_complete[row_with_missing, 0], X_complete[row_with_missing, 1],
                marker='*', c='r',
                s = 70, label='data with missing coordinate', alpha=1)
    plt.scatter(X_hat_mean[row_with_missing, 0], X_hat_mean[row_with_missing, 1],
                marker='d', c='blue',
                label='mean-imputed', alpha = .7)
    plt.scatter(X_hat_iterative[row_with_missing, 0], X_hat_iterative[row_with_missing, 1],
                marker='P', c='black',
                label='iterative-imputed', alpha = .7)
    plt.legend()
    plt.xlabel("X0")
    plt.ylabel("X1")
    plt.title('missing mechanism: {}'.format(mechanism))
    

Estimator

Concerning the estimator, which will modelizes $y$ from $\hat{X}$ ($X$-imputed) we use Boosted decision Trees with its default parameters.

Note: The imputation strategy (e.g. computing the mean or regression) is learned on the train data. Then the same imputation will be used on the test data. This might allows Trees to 'catch' that the mean value could be representing a missing value.

In [11]:
from sklearn.pipeline import Pipeline
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingRegressor

from utils import MIAImputer  # allow to impute with inf and -inf
# Code could be found here https://github.com/TwsThomas/supervised_missing/blob/master/Notebook/utils.py

boosting_with_mean_imputation = Pipeline([
    ('Mean imputation', SimpleImputer(strategy='mean')),
    ('Boosting', HistGradientBoostingRegressor(n_iter_no_change=None))
])   
boosting_with_iterative_imputation = Pipeline([
    ('Iterative imputation', IterativeImputer()), 
    ('Boosting', HistGradientBoostingRegressor(n_iter_no_change=None))
])
boosting_with_mean_imputation_and_mask = Pipeline([
    ('Mean imputation and mask', SimpleImputer(strategy='mean', add_indicator=True)), 
    ('Boosting', HistGradientBoostingRegressor(n_iter_no_change=None))
])   
boosting_with_iterative_imputation_and_mask = Pipeline([
    ('Iterative imputation and mask', IterativeImputer(add_indicator=True)), 
    ('Boosting', HistGradientBoostingRegressor(n_iter_no_change=None))
])
boosting_with_mia = Pipeline([
    ('Iterative imputation and mask', MIAImputer()), 
    ('Boosting', HistGradientBoostingRegressor(n_iter_no_change=None))
])
boosting_with_mean_imputation_ = Pipeline([
    ('Mean imputation', SimpleImputer(strategy='mean')),
    ('Boosting', HistGradientBoostingRegressor(n_iter_no_change=None))
])   
boostings = {
    'Mean': boosting_with_mean_imputation,
    'Iterative': boosting_with_iterative_imputation,
    'Mean+mask': boosting_with_mean_imputation_and_mask,
    'Iterative+mask': boosting_with_iterative_imputation_and_mask,
    'MIA': boosting_with_mia,
}

Model evaluation

In order to evaluate our estimator (RandomForest), we plot:

  • the learning curve: the evolution of the (cross-validated) prediction score ($R^2$-score) versus the training size.
  • a boxplot, easier to read, extracting one slice of the learning curve
In [12]:
from sklearn.model_selection import learning_curve
from sklearn.model_selection import cross_val_score, ShuffleSplit


def plot_learning_curve(estimators, missing_mechanism, data='simple'):
    """ Plot a learning curve for a dictionary of estimators """
    X, y, _, _ = generate_with_missing_values(data=data, n_samples=2 * 10 ** 5,
                                        n_features=10, missing_mechanism=missing_mechanism)
    sizes = np.logspace(2, 4, 7).astype(int)
    
    for key, est in estimators.items():
        train_sizes, _, valid_scores = learning_curve(
            est, X, y, train_sizes=sizes, scoring='r2', cv=8, random_state=0,
            n_jobs=-1)
        
        plt.plot(train_sizes, np.median(valid_scores, axis=1), label=key)
        plt.fill_between(train_sizes,
            np.percentile(valid_scores, q=25, axis=1),
            np.percentile(valid_scores, q=75, axis=1), alpha=0.2)
    
    # label axis
    plt.xscale('log')
    plt.legend(loc='best')
    plt.xlabel('Sample size')
    plt.ylabel('r2 score')
    plt.title(missing_mechanism)
    plt.xlim(train_sizes.min(), train_sizes.max())
    

def boxplot_scores(estimators, n_samples, missing_mechanism, data='simple', plot_id=2):
    """ Plot scores for a dictionary of estimators """
    X, y, _, _ = generate_with_missing_values(data=data, n_samples=2 * 10 ** 5,
                                        n_features=10, missing_mechanism=missing_mechanism)
    
    scores = {}
    for key, est in estimators.items():
        cv = ShuffleSplit(n_splits=8, train_size=n_samples, random_state=0)
        scores[key] = cross_val_score(
            est, X, y, scoring='r2', cv=cv, n_jobs=-1)

    if plot_id<=3:
        plt.title('Missing: {}'.format(missing_mechanism))
    if plot_id%3==1:
        plt.ylabel('data: {}'.format(data))
        labels = None
    else:
        labels = list(scores.keys())[::-1]
        
    bplot = plt.boxplot(list(scores.values())[::-1], labels=labels,
                        vert=False, patch_artist=True)
    # Add colors to boxplots
    for patch, color in zip(bplot['boxes'],
                            ['C0', 'C1', 'C2', 'C3','C4'][::-1]):
        patch.set_facecolor(color)

Results

When data is missing completely at random, iterative imputation (with ot without the mask) has better finite sample properties than mean imputation: it converges faster. However, for censor data, iterative imputation performs poorly without the help of the indicator, while mean imputation seems self-sufficient.

In [13]:
plot_learning_curve(boostings, 'mcar')
plt.title("Learning curve of boosted trees for mcar data");
/home/thomas/anaconda3/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py:706: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
  "timeout or by a memory leak.", UserWarning
/home/thomas/Documents/scikit-learn/sklearn/base.py:203: FutureWarning: From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
  FutureWarning)

Iterative imputation and MIA converge faster.

In [14]:
boxplot_scores(boostings, n_samples=1000, missing_mechanism='mcar')
plt.title("Boxplot of boosted tree for mcar data");

When values are missing completly at random (MCAR), the mask (missing values indicator) does help Mean imputer. Iterative imputation beats Mean imputation thanks to the regression performance.

In [10]:
# boxplot for differents data (rows) and different missing mechanism (columns)
n_samples = 1000
plt.figure(figsize=(15,10))
for id_mm, missing_mechanism in enumerate(['mcar','censored','predictive']):
    for id_data, data in enumerate(['simple','linear','quadratic']):
        plot_id = id_mm + 3*id_data + 1
        plt.subplot(3,3,plot_id)
        boxplot_scores(boostings, n_samples, missing_mechanism, data, plot_id)

The graph above represent the boxplot scores for Boosted Decision Tree on three kind of data 'simple', 'linear' and 'quadratic' (row wise) and with three kind of missing mechanism: 'MCAR', 'Censored' and 'Predictive' (columns wise).

  • When MCAR, iterative imputation does help.
  • When $y$ depends of the missingness mechanism, adding the indicator helps.
In [ ]:
 

Conclusion

  • To train and test on data with missing values, the same imputation model should be used.

  • Missing Incorporated in Attribute (MIA) is a good solution for tree-based models (e.g. lightGBM implemantation). It handles missing values information.

  • Imputation methods reduce the number of samples required to reach good prediction

  • When missingness is related to the prediction target, it is useful to add indicator variables of missing entries as features

Going further

Using data = Friedman

In [14]:
n_samples = 1000
plt.figure(figsize=(15,5))
for id_mm, missing_mechanism in enumerate(['mcar','censored','predictive']):
    for id_data, data in enumerate(['friedman']):
        plot_id = id_mm + 3*id_data + 1
        plt.subplot(1,3,plot_id)
        plot_learning_curve(boostings, missing_mechanism, data=data)
In [15]:
# boxplot for data==friedman and different missing mechanism
n_samples = 1000
plt.figure(figsize=(15,5))
for id_mm, missing_mechanism in enumerate(['mcar','censored','predictive']):
    for id_data, data in enumerate(['friedman']):
        plot_id = id_mm + 3*id_data + 1
        plt.subplot(1,3,plot_id)
        boxplot_scores(boostings, n_samples, missing_mechanism, data, plot_id)
In [ ]: