Chapter 2 Theoretical Background

2.1 A brief introduction to missing data and multiple imputation

Missing data pose a threat to the validity of statistical inferences, when they are (a) numerous, (b) not missing completely at random in the sense of Rubin (1976), and (c) when they are handled in an inadequate way for example by resorting to simple ad-hoc methods such as unconditional mean imputation (Schafer, 1997a). Multiple imputation (MI) is a state-of-the-art method to handle missing data (Schafer & Graham, 2002).

2.1.1 Multiple imputation in a nutshell

Since Rubin’s book (Rubin, 1987) – laying out the theoretical foundation – and Schafer’s software norm (MI based on the multivariate normal model, Schafer, 1997a), MI has become more and more popular and has become one of the standard procedures to handle missing data (Schafer & Graham, 2002). Today, different frameworks, methods, and models are availabe to create multiple imputations in many scenarios and contexts.

Multiple imputation procedures are generally preferable to single imputation, since single imputation methods typically do not produce unbiased standard error erstimates, unless some correction procedure is applied.

Multiple imputation following Rubin’s theory (Rubin, 1987) is a three-step-process: In the first step, each missing value is filled in \(m\) times based on an appropriate prediction model for the incomplete data given the observed data. In the second step, the \(m\) completed data sets are analysed separately by any complete data technique and thirdly, the \(m\) sets of model parameters and standard errors are integrated into an overall set of results using formula given in Rubin (1987) or Barnard & Rubin (1999). These combining rules integrate two variance components: variance within and between the imputed data sets. Variance between the imputed data sets is supposed to reflect the additional estimation uncertainty due to missing data in an adequate way.

Most MI methods today assume that the data are missing at random (MAR) in the sense of Rubin (1976), meaning that missingness can be predicted by observed information in the data set and does not depend on unobserved information. To make the MAR assumption more plausible in general, it is advisable to identify (and include into the imputation model) both variables that are well related to both the incomplete variables and their missingness indicators (see for example Collins, Schafer, & Kam, 2001; van Buuren & Groothuis-Oudshoorn, 2011).

2.1.2 Multiple imputation frameworks

The two common frameworks for creating multiple imputations are joint modeling (JM, e.g. Schafer, 1997a, 1997b) and conditional modeling (CM), also known as sequential regressions multiple imputation, fully conditional specification (FCS), or multiple imputation by chained equations (mice, van Buuren, 2012). In JM, a joint imputation model is specified for the data set as a whole, whereas in CM, separate conditional imputation models are specified for each incompletely observed variable in the data file. The mice software is based on the conditional modeling approach. The idea of CM is to use conditional distributions to model an underlying joint distribution and was originally developed to have an alternative “general purpose multivariate imputation procedure that can handle a relatively complex data structure where explicit full multivariate models cannot be easily formulated” (Raghunathan, Lepkowski, Van Hoewyk, & Solenberger, 2001, p. 86), for example when the data set consists of variables of many different types (e.g. continuous, categorical, count).

2.1.3 The mice package

mice is an excellent R package to create multiple imputations (van Buuren & Groothuis-Oudshoorn, 2011). countimp extends the functionality of package mice by providing additional imputation functions for various kinds of parametric count models. mice has a couple of advantages we want to point out:

  • mice already comes with a great variety of imputation methods and models
  • mice allows to call user-written imputation functions, therefore its functionalities can be easily extended.
  • mice offers highly helpful tools to assess convergence of the imputation algorithm and to assess plausibility of the obtained imputations. For an overview, see van Buuren & Groothuis-Oudshoorn (2011), or van Buuren (2012).

Here, we focus only on the most important essentials to get novice users started. Imputations are created by the main mice() function, which has the following arguments:

library(mice)
args("mice")
## function (data, m = 5, method = vector("character", length = ncol(data)), 
##     predictorMatrix = (1 - diag(1, ncol(data))), where = is.na(data), 
##     visitSequence = NULL, form = vector("character", length = ncol(data)), 
##     post = vector("character", length = ncol(data)), defaultMethod = c("pmm", 
##         "logreg", "polyreg", "polr"), maxit = 5, diagnostics = TRUE, 
##     printFlag = TRUE, seed = NA, imputationMethod = NULL, defaultImputationMethod = NULL, 
##     data.init = NULL, ...) 
## NULL

m sets the number of imputations (default is m = 5), maxit sets the maximum number of iterations for the Gibbs sampler.

Imputation methods are specified via argument method. A not exhaustive overview of imputation methods that are already implemented in mice is listed in Table 2.1.

Table 2.1: An overview of imputation methods in mice
Name Description Scale
pmm predictive mean matching numeric
norm Bayesian linear regression numeric
2l.norm 2-level linear mixed effects model numeric
logreg Bayesian logistic regression factor (2 levels)
polyreg polytomous regression factor (\(>2\) levels)
sample random sample from observed data any

The procedures are all described in detail in van Buuren & Groothuis-Oudshoorn (2011) and van Buuren (2012). Argument method must be a character vector of length equal to the number of variables in the data set, indicating the missing data procedures with which each variable is to be imputed. The first slot of method sets the imputation function for the first variable in the data file, and so on. If method is not specified the defaultMethod is used, which is pmm, predictive mean matching for continuous data, logreg, i.e. Bayesian logistic regression for factors with two levels and polyreg, polytomous logistic regression for factors with more than two levels. Completely observed variables have method "", indicating that they need not be imputed. The respective imputation functions are called mice.impute.name where name identifies the respective imputation function. For example, specifying "logreg" as imputation method for a variable calls function mice.impute.logreg().

library(mice)
mice.impute.logreg
## function (y, ry, x, wy = NULL, ...) 
## {
##     if (is.null(wy)) 
##         wy <- !ry
##     aug <- augment(y, ry, x, wy)
##     x <- aug$x
##     y <- aug$y
##     ry <- aug$ry
##     wy <- aug$wy
##     w <- aug$w
##     x <- cbind(1, as.matrix(x))
##     expr <- expression(glm.fit(x = x[ry, , drop = FALSE], y = y[ry], 
##         family = binomial(link = logit), weights = w[ry]))
##     fit <- suppressWarnings(eval(expr))
##     fit.sum <- summary.glm(fit)
##     beta <- coef(fit)
##     rv <- t(chol(sym(fit.sum$cov.unscaled)))
##     beta.star <- beta + rv %*% rnorm(ncol(rv))
##     p <- 1/(1 + exp(-(x[wy, , drop = FALSE] %*% beta.star)))
##     vec <- (runif(nrow(p)) <= p)
##     vec[vec] <- 1
##     if (is.factor(y)) {
##         vec <- factor(vec, c(0, 1), levels(y))
##     }
##     return(vec)
## }
## <environment: namespace:mice>

For details about the respective arguments, see help("mice.impute.logreg").

Imputation models are defined via the predictorMatrix argument. predictorMatrix must be a rectangular matrix of dimensions equal to the number of variables in the data set. An example is given below:

library(countimp)
data(crim4w)
quickpred(crim4w)
##        id FEMALE RE GY HA ACRIM BCRIM CCRIM DCRIM
## id      0      0  0  0  0     0     0     0     0
## FEMALE  0      0  0  0  0     0     0     0     0
## RE      0      0  0  0  0     0     0     0     0
## GY      0      0  0  0  0     0     0     0     0
## HA      0      0  0  0  0     0     0     0     0
## ACRIM   1      1  0  1  1     0     1     1     1
## BCRIM   1      1  0  1  1     1     0     1     1
## CCRIM   1      1  0  1  1     1     1     0     1
## DCRIM   1      1  0  1  1     1     1     1     0

Each row \(i\) in that matrix gives the imputation model of variable \(i\) in the data file. The zeros and ones indicate (0 = no; 1 = yes), if the respective variable in column \(j\) is used to predict missing data in variable \(i\). Using the information from the predictorMatrix, mice automatically creates three objects that are passed on to the respective mice.impute.name sub-function: y, x and ry. y is an incomplete data vector of length \(n\), the variable to be imputed. x is an \(n \times p\) matrix of predictors, those variables that were specified via the respective row in the predictorMatrix. ry is the response indicator of vector y, indicating if a value in y has been observed (TRUE), or not (FALSE). For descriptions of further arguments, see help("mice").

2.1.4 Bayesian regression and bootstrap regression

Regression based imputation methods can be classified as either deterministic or stochastic, depending on whether or not there is some degree of randomness in the imputation process. A deterministic imputation approach would impute the value that is predicted for the respective incomplete case. Such approaches usually cause an underestimation of standard errors, since all imputed values lie directly on the regression line.

Multiple imputation based on Rubin’s theory (Rubin, 1987) includes a stochastic component, that introduces an adequate amount of between-imputation variability and to reflect the estimation uncertainty due to missing data.

Two common solutions in CM are Bayesian regression (see for example Rubin, 1987, pp. 169–170), and bootstrap regression: Bayesian regression first fits a model to the complete part of the data and computes the posterior mean \(\hat{\beta}\) and posterior variance \(V(\hat{\beta})\) of the model parameters \(\beta\). Then, new parameters are simulated from \(\mathcal{N}(\hat{\beta},V(\hat{\beta}))\). These new parameters are then used to make predictions for the incomplete part of the data and to obtain the imputations.

The bootstrap variant draws a bootstrap sample from the observed cases, fits the regression model to the bootstrap sample and uses the obtained model parameters to make predictions for the incomplete part of the data.

Imputation functions from package countimp are available as Bayesian regression and bootstrap regression variants.

2.2 A brief introduction to count data models

A good introduction and summary may be found in Zeileis et al. (2008). A very comprehensive text is Hilbe (2011). The standard procedure to analyze ordinary count data is to fit a Poisson model under the generalized linear modeling (GLM) framework. GLMs describe the dependence of a scalar variable \(y_i\) on a set of regressors \(x_i\). The conditional distribution of \(y_i\vert x_i\) is a linear exponential family with probability density

\[\begin{equation} f(y;\lambda,\delta) = exp\left( \frac{y\lambda-b(\lambda)}{\delta}+c(y,\delta) \right), \end{equation}\]

with \(\lambda\) depending on \(x_i\) via a linear predictor, \(\delta\) being a dispersion parameter, and \(b(\cdot)\) and \(c(\cdot)\) being functions that determine, which member of the family (e.g., Poisson) is used.

The classical poisson model

\[\begin{equation} f(y;\mu)=\frac{exp(-\mu)\mu^y}{y!} \end{equation}\]

with link function \(g(\mu)=log(\mu)\) assumes that the variance \(V(\mu)\) is equal to the mean \(\mu\). In Poisson regression, the probability of observing a certain count \(y_i\) given covariates \(x_i\) is

\[\begin{equation*} P(Y_i=y_i\vert x_i)=\frac{e^{-\mu_i}\mu_i^{y_i}}{y_i!}, \quad y_i=0,1,2,\dots. \end{equation*}\]

The conditional mean is \(E(y_i\vert x_i)=\mu_i=e^{x_i^\prime\beta}\), the variance is assumed to be equal to the mean.

A problem that often arises when analysing empirical data is that the equidispersion assumption – that the variance has to be equal to the mean – is violated. Very often empirical data are overdispersed, which means that the variance is larger in comparison to the mean. Analyzing overdispersed data using classical Poisson regression could lead to an underestimation of the variation in the data and an overestimation of statistical significance (cf. Zeileis et al., 2008). Models that allow for overdispersion are the quasi-Poisson or the negative binomial model.

The quasi-Poisson model is identical to the Poisson model, except that it estimates dispersion from the data. Another solution to address overdispersion in the data is to fit a negative binomial (NB) model – a gamma Poisson mixture model. There are a number of different parametrizations of the NB models (for details, see Hilbe, 2011). Note that the NB model and the Poisson model are nested and that the NB distribution converges to a Poisson distribution, when overdispersion \(\rightarrow 0\).

An excess of zero counts can be addressed by fitting a zero-inflated Poisson or negative binomial model (Lambert, 1992) or by fitting a hurdle model (Mullahy, 1986). countimp supports both options. Both model classes are mixture models and contain a second model component that addresses the excess zeros in the empirical data: Hurdle models combine a zero-truncated count component (a zero-truncated Poisson or NB model) and a hurdle component (usually a binomial GLM). Zero-inflation models combine a point mass at zero and the count component (a Poisson or NB model). The result of a Bernoulli trial determines, which process is used.

Different sets of covariates can be specified in each model component – since in practice it is often the case that the zero process and the count process is determined by different sets of predictors.

The main difference between zero-inflation models and hurdle models is that the latter allow two possible sources of zeros. Zeros may stem from either the count part or the zero part of the model.