This is a research experiment of using mice
package (Buuren and Groothuis-Oudshoorn, 2011) to impute missing values in multivariate data. Basically, mice
uses Fully Conditional Specification (FCS) method, whereas chained equation is considered an alias of that term. According to Liu and De (2015), FCS is more relaxed than the conventional method Joint Modeling (JM) in terms of specifiying distribution of the missing data.
In this experiment, we will use an open source dataset Titanic (Kaggle, 2012) as the sample data of the experiment.
1. Multiple imputation vs Single imputation
Both multiple and single imputation techniques are ones that can be used to replace missing values in the data by plausible values. In single imputation, we substitute the missing values of the data by a single value. For example, we can replace missing values of a variable by the most common value, or the mean/median value (in case that variable is continuous). Multiple imputation is different in a sense that missing values are replaced by multiple plausible values. Using mice
package, we can even have multiple version of imputed data by trying to impute the data more than one time.
2. Notation
- $\mathbf{Y_j}$ $(j = 1,…, p)$: one of $p$ incomplete variables.
- The observed and missing parts of $Y_j$ are denoted by $\mathbf{Y_j^{obs}}$ and $\mathbf{Y_j^{mis}}$.
- $Y^{obs} = (Y^{obs},…,Y^{obs}_p)$: observed data in Y.
- $Y^{mis} = (Y^{mis},…,Y^{mis}_p)$: missing data in Y.
- $\mathbf{m \ge 1}$: the number of imputation.
- $\mathbf{Q}$: predicted model that would be trained on $Y$.
3. Features
mice
introduces a modular approach to impute data. The imputation framework offered by mice
comprises of 3 steps: imputation, analysis and pooling (Figure 1).
mice()
Impute incomplete data, creating imputed versions of the data $Y^{(1)}$, $Y^{(2)}$, …, $Y^{(m)}$.with()
Estimate the predicted model $Q$ on each imputed dataset $Y^{(i)}$, creating multiple versions of the model, which are $\hat{Q}^{(1)}$, $\hat{Q}^{(2)}$,…, $\hat{Q}^{(m)}$pool()
Pool $\hat{Q}^{(1)}$, $\hat{Q}^{(2)}$,…, $\hat{Q}^{(m)}$ into one estimates $\bar{Q}$
A notable feature of this imputation model is that we use our predicted model of interest $Q$ to analyse the imputed data by ourself. By doing that, we can specifically select the best imputed data that suits our investigation.
4. Experiment
4.1. Missing value inspection
First and foremost, load the package and data. In this experiment, we will use Titanic data free available from Kaggle.
Our data contains a lot of missing values. We will use different methods to inspect them. First, we will use md.pattern
method to inspect missing-data patterns.
Let’s intepret information from the retrieved table. Note that in the table, 0 represents missing in the data. Look at the table by row, the table tells us that we have 705 complete rows, 169 rows where only Age
is missing, 7
rows where only Fare
is missing, 2 rows where Embarked
is missing, and 8 rows where both Fare
and Age
are missing. Vertically, the tables tells us that Embarked
has 2 missing values, Fare
15, and Age
177. In overall, we have 194 missing values in the data, and 169 + 7 + 2 + 8 = 186 incomplete rows.
Besides md.pattern
, we can visually inspect missing patterns of the data using the package VIM
. First, we will use aggr
to aggregate missing values.
The method aggr
is essentially similar to md.patterns
, except that aggr
displays missing values in portion. For example, roughly 20% (177 rows) of Age
are incomplete, 1.6% (15 rows) of Fare
missing, and 0.2% (2 rows) of Embarked
missing. There are roughly 0.9% (8 rows) of the data where both Age
and Fare
are missing. Thus, it seems more convenient when combine both md.pattern
and aggr
together to inspect missing values of the data.
We can also observe missing values by pair. The function marginplot
is perfect for this purpose. We will use marginplot
to inspect missing patterns of the pair Age
and Fare
.
Let’s intepret information given by the plot.
- Blue points in the main area indicates the number of rows that both
Age
andFare
are complete. The red points in the left side indicates missing values ofFare
whenAge
is observed. Similarly, the red points in the bottom margin indicates missing values ofAge
whenFare
is observed. - Three numbers tells us that there are 15 missing values in
Fare
whenAge
is observed, 177 missing values inAge
whenFare
is observed, and 8 commonly missing values for both. - The blue and red boxplots summarise the marginal distribution of complete and incomplete rows. For example, the distribution of complete
Age
values is (0, 80) years old, the distribution of missing value inFare
whenAge
is observed is from (20, 50) years old (i.e., 15 passengers who has missing values ofFare
are from 20 to 50 years old).
4.2. Data imputation with mice()
We can simply impute the data by using the function mice()
. For example,
By default, mice()
will create 5 version of imputed data (i.e., m = 5
), and for each version, the default number of iterations is also 5 (i.e., maxit = 5
). According to the authors, the convergence of Gibbs sampling often happens when maxit
is between 10 and 20.
We can see that missing values no longer exist in the data. Let’s see how the data is imputed across 5 versions. First begin with Embarked
So, mice()
imputed missing values in Embarked
with different values. Note that in the original data, there are 644 rows of S
. So I suspect the imputed values are likely (S,S).
Next, we will explore the distribution of imputed values of Age
and Fare
. It can be seen from both plots that the distribution of imputed values are similar to that of observed values.
4.2. Data analysis with with()
After imputation, we will analyse imputed data using with()
. As said, we analyse the imputed data by our predict model of interest $Q$.
4.3. Result pooling with pool()
5. Convergence assessment
Convergence assessment of Gibbs sampling algorithm is important to check that if the algorithm is converged. According to the paper: “There is no clear-cut method for determining whether the Gibbs sampling algorithm has converged. What is often done is to plot one or more parameters against the iteration number” and “on convergence, the different streams should be freely intermingled with each other, without showing any definite trends. Convergence is diagnosed when the variance between different sequences is no larger than the variance with each individual sequence”.
6. Conclusion
In this experiment, we have demonstrated how missing values in the data can be imputed using FCS technique. When compare FCS with JM technique, MICE is considered as a more adaptable method when we can select appropriate imputation models for variables on a case by case basis. For example, we can choose linear regression or Bayesian linear regression for continuous variables, and logistic regression or linear discriminant analysis for nominal variables.
mice
package in R is a powerful and convenient library that enables multivariate imputation in a modular approach consisting of three subsequent steps. First, we can impute missing values by using a single mice()
function, then effectively analyse imputed versions of data by using with()
method with our own model of choice, and finally report the imputation result by using pool()
method.
After this experiment, we believe that mice
package is capable of supporting our future research experiments, where we would have chance to explore additional features of it, such as, implementing our own imputation models.
References
Buuren, S. & Groothuis-Oudshoorn, K. (2011, December). MICE: Mulitivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1–67. Retrieved from http://doc.utwente.nl/78938/
Kaggle. (2012, September). Titanic: machine learning from disaster. Retrieved from https://www.kaggle.com/c/titanic
Liu, Y. & De, A. (2015, July). Multiple Imputation by Fully Conditional Specification for Dealing with Missing Data in a Large Epidemiologic Study. International Journal of Statistics in Medical Research, 4(3), 287–295. doi:10.6000/1929-6029.2015.04.03.7