R-miss-tastic

A resource website on missing values - Methods and references for managing missing data

Below you will find a selection of high-quality lectures, tutorials and labs on different aspects of missing values. Note that some of these lectures are available with publicly available video recordings.


This course provides an overview of modern statistical frameworks and methods for analysis in the presence of missing data. Both methodological developments and applications are emphasized. The course provides a foundation in the fundamentals of this area that will prepare to read the current literature and to have broad appreciation the implications of missing data for valid inference. Course page.


The ability to easily collect and gather a large amount of data from different sources can be seen as an opportunity to better understand many processes. It has already led to breakthroughs in several application areas. However, due to the wide heterogeneity of measurements and objectives, these large databases often exhibit an extraordinary high number of missing values. Hence, in addition to scientific questions, such data also present some important methodological and technical challenges for data analysts. This tutorial gives an overview of the missing values literature as well as the recent improvements that caught the attention of the community due to their ability to handle large matrices with large amount of missing entries. The methods presented in this tutorial are illustrated on medical, environmental and survey data.


Missing values are ubiquitous in the practice of data analysis. In this series of lectures, we will start by presenting classical methods for handling missing data (simple imputation, multiple imputation, likelihood-based methods) developed in an inferential framework, where the objective is to best estimate parameters and their variance in the presence of missing data.We will emphasize very powerful methods of simple and multiple imputation based on low-rank approximations that can be applied to heterogeneous data (quantitative, categorical). We will then present recent results in a supervised learning framework. A striking result is that naive imputation strategies (such as mean imputation) can be optimal, as the supervised learning method does the hard work. The fact that such a simple approach can be relevant may have important consequences in practice. We will also discuss how missing value modeling can be easily incorporated into tree models, such as gradient boosted trees, resulting in a learner that has been shown to perform very well, including in challenging non-random missingness settings.Notebooks will be presented. Finally, we will briefly present how such results are useful in the context of causal inference with missing values in the covariates.


This course focuses on the theory and methods for missing data analysis. Topics include maximum likelihood estimation under missing data, EM algorithm, Monte Carlo computation techniques, imputation, Bayesian approach, propensity scores, semi-parametric approach, and non-ignorable missing data.


This course formally introduces methodologies for handling missing data in statistical analyses. It covers naive methods, missing-data assumptions, likelihood-based approaches, Bayesian and multiple imputation approaches, inverse-probability weighting, pattern-mixture models, sensitivity analysis and approaches under nonignorable missingness. Computational tools such as the Expectation-Maximization algorithm and the Gibbs’ sampler will be introduced. This course is intended for students who are interested in methodological research.
Course syllabus

Exercices/Homework

Statistical modeling and missing data (video)

(Rod, keynote talk at virtual workshop on Missing Data Challenges in Computation Statistics and Applications, fall 2020)



This course is the second part of a NIHES course on Missing Values in Clinical Research and it focuses on multiple imputation (MI), specifically the fully conditional specification (FCS, MICE), which is often considered the gold standard to handle missing data. A detailed discussion on what MI(CE) does, which assumptions need to be met in order for it to perform well, and alternative imputation approaches for settings where MICE is not optimal are given. The theoretic considerations will be accompanied by demonstrations and short practical sessions in R, and a workflow for doing MI using the R package mice will be proposed, illustrating how to perform (multiple) imputation for cross-sectional and longitudinal data in R.


This short course on multiple imputation gives an overview of missing data problems, various solutions to tackle them as well as their limitations. It introduces to MI inferences and provides details on implementation and application of MI.



This tutorial is part of a master course on statistics with R. It discusses different missing values problems and illustrates them on medical, industrial and ecologial data. It provides a detailed introduction to single and multiple imputation via principal component methods, both in theory and in practice. The practical part illustrates how to perform (multiple) imputation using the R package missMDA.


These two videos can be viewed independently or as a complement to the above tutorial on Imputation using principal components as they provide detailed explanation on how to use the functions of the missMDA package to visualize and analyze missing values and how to perform (multiple) imputation.



This keynote talk gives an overview of different approaches for inference and prediction tasks. A striking result for the latter is that the widely-used method of imputing with the mean prior to learning can be consistent.


This course recalls basic concepts of surveys and data collection before discussing how to handle unit non-response and item non-response in surveys.


In follow-up studies different types of outcomes are typically collected for each subject. These include longitudinally measured responses (e.g., biomarkers), and the time until an event of interest occurs (e.g., death, dropout). Often these outcomes are separately analyzed, but in many occasions it is of scientific interest to study their association. This type of research question has given rise in the class of joint models for longitudinal and time-to-event data. These models constitute an attractive paradigm for the analysis of follow-up data that is mainly applicable in two settings: First, when focus is on a survival outcome and we wish to account for the effect of endogenous time-dependents covariates measured with error, and second, when focus is on the longitudinal outcome and we wish to correct for non-random dropout. This course is aimed at applied researchers and graduate students, and will provide a comprehensive introduction into this modeling framework. It provides explanation when these models should be used in practice, which are the key assumptions behind them, and how they can be utilized to extract relevant information from the data. Emphasis is given on applications, and after the end of the course participants will be able to define appropriate joint models to answer their questions of interest.


This tutorial gives a short overview about methods for missing data in time series in R in general and subsequently introduces the imputeTS package. The imputeTS package is specifically made for handling missing data in time series and offers several functions for visualization and replacement (imputation) of missing data. Based on usage examples it is shown how imputeTS can be used for time series imputation.


While the problem of missing values in the covariates has been considered very early in the causal inference literature, it remains difficult for practitioners to know which method to use, under which assumptions the different approaches are valid and whether the tools developed are also adapted to more complex data, e.g., for high-dimensional or mixed data. This talk provides a rigorous classification of existing methods according to the main underlying assumptions, which are based either on variants of the classical unconfoundedness assumption or relying on assumptions about the mechanism that generates the missing values. It also highlights two recent contributions on this topic: first an extension of classical doubly robust estimators that allows handling of missing attributes and second an approach to causal inference based on variational autoencoders in the case of latent confounding.


The estimation of count data, such as bird abundance, is an important task in many disciplines and can be used for instance by ecologists for species conservation. Collecting count data is often subject to inaccuracies and missing data due to the nature of the counted object and due to multiplicity of actors/sensors collecting the data over more or less long periods of time. Methods such as Correspondence Analysis or Generalized Linear Models can be used to estimate these missing values and allow a more accurate analyses of the count data. The objective of this project is to investigate the abundance for the Eurasian Coot, which is mainly observed in the mediterranean part of North-Africa, and its relation to external geographical and meteorological factors. First, different methods are compared in terms of accuracy, using R packages glm, Rtrim, Lori and missMDA. Afterwards, external factors and their impact on bird abundance are examined and finally the temporal trend is investigated to determine whether the Eurasian coot is declining or not.
This project was carried out in collaboration with the Research Institute for the conservation of Mediterranean wetlands, the association Les Amis des Oiseaux (Friends of the birds) and the Office National de la Chasse et de la Faune Sauvage (National Agency for Hunting and Wildlife).



If you wish to contribute some of your own material to this platform, please feel free to contact us.


Share