## FAQ

When it comes to analyses with missing values, some questions are raised regularely during classes or seminars. We try to list the most popular questions with some elements of response. If you have another question related to the handling of missing values, feel free to contact us via the Contact form.

Click on a question to see the answer.

*How to handle missing values in the validation or test set? Is it better to impute test and training set simultaneously, or separately?*

For prediction tasks, the same imputation model has to be used for training and test set. However this is not always possible when imputing with some blackbox imputation function that does not allow for specification of a given imputation model. (*JJ*)

See this recent article for a discussion of this topic and this video of a keynote at the useR! 2019 conference on the same subject.

*What percentage of missingness is large? Can we impute 90% of missingness using multiple imputation given that some thousands of imputed datasets are generated?*

The question of the percentage of missing data is one of the most frequent questions from users. We are often asked:*if I have 30% NA, is that too much? and 40%, etc.?*

It is not only the percentage of missing data that counts, but also the structure of the data. A simple example to understand this point is a data set with 100 variables that are all identical, so the correlation between these variables is 1. Even with 80% missing data, many imputation techniques will be able to perfectly predict the missing values. Therefore, the variability associated with the prediction will be zero. It is also possible to have a data set, where the information is very unstructured and therefore even a very small percentage of missing data can completely destroy the links between the variables.

Of course, we do not know *a priori* the structure of the data. This is why it is imperative, with missing data, to consider the notions of variability and confidence in the results. Multiple imputation, for example, reflects the prediction variance of missing data. A first way to assess the impact of missing data is to use visualization tools to visualize the different imputed values. Then, of course, the size of the confidence intervals will be a good indicator. (*JJ*)

*K-NN is another method in estimating/imputing missing values; do you think this method can be used for every kind of data?*

The idea of imputing from a closer neighbour is a sensible strategy. The problem here is not only missing data but the problem of k-NN for large dimensional datasets with heterogeneous variables (quantitative, categorical, etc). It is necessary to have an appropriate distance to take into account the mixed nature of the data and possibly reduce the size before computing the distances, so for many data sets it is not immediate to apply a k-NN algorithm for imputation. (*JJ*)

*Are there tools that help you in the decision making process which imputation method to use based on the structure of your data?*

It always depends on the objective: If we only want to impute and therefore best predict missing values, we can always do cross-validation (add missing cells to the data, predict with different techniques and select the method that gives the smallest prediction error). Afterwards you can also be guided through theoretical arguments. I impute a lot of my data with dimension reduction techniques (low-rank approximation), because it is quite plausible to think that a lot of data can be well approximated by matrices of low rank. (*JJ*)

Here is an interesting reference on this topic: Udell, M. (2019). Big Data is Low Rank. SIAG/OPT Views and News

*In business, it's "time is money". Do you think the benefit of imputation is high enough to take the time consuming effort of imputing even in a one-time-analysis?*

Yes, definitely! The consequences of not taking into account the missing data can become dramatic very quickly. Even without mentioning underestimation of variance, there can be a significant bias! For example, at the moment I am working on estimating the effect of a treatment and if we do not take into account the missing data, we can say that the treatment kills when it saves. (*JJ*)

*For non specialists, is there any function inside any package that just takes a dataset as argument, and returns the dataset with the best imputations/deletions?*

There are starting to be first R packages like, `missCompare`

, which allow to compare several imputation methods. There are still a lot of things to fix because all methods have many default settings, etc. But, on the R-miss-tastic platform we will try to put together some workflows that help the user to easily make this type of comparison. (*JJ*)

*When a lot of complete data is still available, would you always suggest imputation considering that (poor) imputation might bias the results?*

If we have good reason to believe that the missing data are completely at random (MCAR), then yes, with a lot of data, we can work on the complete data because we will have samples that come from the joint distribution of the data. Otherwise, even if we have a lot of data, they represent a sample that is not representative of the population. The classic example is missing income data: if rich or poor people do not disclose their income, it is clear that there is a selection bias in the complete case (MNAR data). But even if it is the young or the elderly who do not give their income and that income and age are very linked, we have the same problem of selection bias (MAR data). (*JJ*)

*If the missingness is informative, what to do if the fact that the variable is missing is more predictive of the outcome than the unobserved value?*

If having missing data is informative for prediction, we see that having an indicator in your dataset that codes for *missing*/*not missing* will help because it is seen as an explanatory variable. The MIA method (Twala et al. 2008) for regression trees/random forests allows this to be done. (*JJ*)

See also, Josse et al. (2019)

*What do you suggest doing if you suspect that data is actually missing not at random? Are there any available options or can't we run any analysis?*

Yes, there are solutions that consist in modelling the mechanism of missing data, often this requires having a fairly strong prior on the parametric form of the distribution of missing data. But the practical solutions are still quite limited. There is a series of new approaches based on graphical and causal models that can be used to address missing MNAR data without modeling the mechanism and that offer new solutions but the solutions are still limited to simple models such as the linear model. See for instance Mohan and Pearl (2019). (*JJ*)

*Do we have widely accepted combination rules after multiple imputation for p-values?*

I would tend to say no, but that is to be checked. What is certain is that Rubin's aggregation rules are not suitable for many quantities and that there is still a lot of research to be done on the subject. (*JJ*)

*How to avoid missing values in hierarchical features (for instance a series of interdependent questions in a survey)?*

You simply need to create a single variable with different categories, encoding the different series of possible answers.

For example,

*(1) Do you have a bank account? Yes/No*

*(2) If yes to (1): How many bank accounts do you have, <5 or >5?*

*(3) If >5: what is the total value? If <5, what is the value of account 1 to 5?*

will be coded in **one** variable with the following categories: *Yes >5_1*, *Yes >5_2*, *Yes >5_3*, *Yes >5_4*, *Yes >5_5*, *Yes <5* and *No*. (*JJ*)

*If you have a learner powerful enough to recognise encoded missing values, shouldn't it be able to recognise NA without resorting to recoding?*

(Question relative to (Josse et al. 2019))

Yes, that's the point. We do a recoding, just because the implementations of most methods stop when they see the `NA`

symbol for missing. They don't take it as a code. (*JJ*)