How to predict with missing values in Python ?

Aude Sportisse

Missing values occur in many applications of supervised learning. Methods to deal with missing values in supervised learning are very different from methods to deal with missing values in a inferential framework. For instance mean imputation which is the worst thing that can be done when the aim is to estimate parameter can be consistent when the aim is to predict as well as possible, as shown in this recent paper.

In this notebook, we will cover how to accurately predict the response y given X when X contains missing values (missing values in both training and testing data).

In this case, there are essentially two approaches:

  1. Two-step strategy: imputing missing data and applying classical methods on the completed data sets to predict;
  2. One-step strategy: Predicting with methods adapted to the missing data without necessarily imputing them.

We first describe the methods on synthetic data and apply them on real datasets.

Description of the different strategies

We generate the covariates $X$ with 3 variables from a Gaussian distribution with positive structure of correlation, the Gaussian noise $\epsilon$ from the standart Gaussian distribution and the fixed regression parameter $\beta$ with the uniform distribution. The outcome variable is obtained with the following linear model: $$Y=X\beta+\epsilon.$$

We introduce some missing (here MCAR) values in the data matrix using the function produce_NA given in the Python Notebook How to generate missing values in Python ?.

Two-steps strategy

We will consider two imputation methods:

More details on these methods can be found in How to impute missing values in Python.

Note that Josse et al. study the classic tools of missing values in the context of supervised learning. More particularly, they give the following take-home messages:

To concatenate the missing indicator to X, we can use the argument add_indicator=True of SimpleImputer. Note that this concatenation is done after imputation and is only used for prediction.

One-step strategy

We compare these imputations methods to a learning algorithm which can perform predictions by directly accounting for missing values:

This method does not pretend to impute missing data. Here, the step of duplicating features is internal in the tree based learning algorithm.

Pipeline

Let's evaluate the different strategies. Let's consider our different imputers w.r.t. different machine learning algorithms. The pipeline will be

  1. Imputation.
  2. Regression on the imputed dataset.

Here we decompose each step of the pipeline for clarity.

First, we can split the data intro train and test datasets.

We can then choose a learning algorithm, for exemple the random forests. We can use the class sklearn.linear_model of scikit-learn. Note that we can not directly apply the learner, since it can not deal with missing values.

We fit the model for imputing missing values in the train dataset and then transform both train and test with the same imputer.

Finally, we can fit the learner.

Method selection on synthetic data

The function score_pred compares the strategies above for synthetic data in terms of prediction performances by applying chosen learning algorithms.

More precisely, the function takes as imput a complete data matrix (X) and an output variable (y). Then, missing values are introduced in the complete data matrix with both specific percentage of missing values (p) and missing-data mechanism (mecha). Each method is performed with a learning algorithm (learner). The methods are detailed below:

The introduction of the missing values is done several times (nbsim), it implies the stochasticity in the results (and boxplots).

The arguments are the following.

It returns scores for each strategy.

We apply this function by introducing MCAR or MNAR values in X. MCAR means that the probability that an observation is missing is independent of the data. MNAR means that the missingness depends on the missing values and potentially also on the observed values. To introduce missing values, we use How to generate missing values.

Method selection on real data

The function plot_score_realdatasets can be used for real datasets containing missing values. The arguments are the following.

It returns Boxplot scores for each method (Mean imputation, Iterative imputation, MIA), the stochasticity comes from the way to split the dataset into a train set and a test set which is repeated several times.

Here, we study a real dataset which does not contain real missing values, thus we add some missing values (MCAR or MNAR) before applying the function plot_score_realdatasets. In this case, we can compute the scores for the complete matrix, which are represented in the boxplots.