Chapter 4 countimp

4.1 A cautionary note

Package countimp creates imputations based on parametric count models. Plausibilty of the obtained imputations will depend upon the correct specification of the imputation procedure, i.e. the count data regression model that is used to predict missing information must fit the empirical data well.

Research by Kleinke & Reinecke (2013) suggests that misspecifications of the (count) imputation model can bias statistical inferences quite noticeably. We therefore advise researchers to spend sufficient time on modelling. We would like to refer readers to the excellent book by Hilbe (2011) to determine which count model fits the data best.

We would also like to draw attention to R packages Qtools and ImputeRobust, which allow to create imputations based on nonparametric quantile regression and semi-parametric generalized additive models, which might be a suitable alternative in situations, where assumptions of parametric count models are violated.

4.2 Count data models

An overview of parametric imputation models available in package countimp is given in Table 4.1. More details on the theoretical background of count regression models is given in Zeileis et al. (2008), and in Hilbe (2011).

Table 4.1: An overview of imputation methods and models in package countimp
Function Name model / family
mice.impute.poisson() Poisson
mice.impute.quasipoisson() Quasi-Poisson
mice.impute.nb() negative binomial
mice.impute.zip() zero-inflated Poisson
mice.impute.zinb() zero-inflated negative binomial
mice.impute.hp() hurdle Poisson
mice.impute.hnb() hurdle negative binomial
mice.impute.2l.poisson() two-level Poisson
mice.impute.2l.nb() two-level negative binomial
mice.impute.2l.zip() two-level zero-inflated Poisson
mice.impute.2l.zinb() two-level zero-inflated NB
mice.impute.2l.hp() two-level hurdle Poisson
mice.impute.2l.hnb() two-level hurdle negative binomial

4.3 Installation

The latest version can be installed from GitHub using function install_git from package devtools:

devtools::install_git(url = "https://github.com/kkleinke/countimp", 
                      branch = "master")

4.4 Imputation of incomplete Poisson distributed variables

The function to impute an incomplete Poisson distributed variable is called mice.impute.poisson(). Appending .boot to the function name calls the bootstrap regression variant (see help("mice.impute.poisson.boot")). The functions have the following arguments:

library(countimp)
args("mice.impute.poisson")
## function (y, ry, x, wy = NULL, EV = TRUE, ...) 
## NULL

Arguments y, ry, x need not be specified. They are automatically obtained from mice. For information on how to use mice’s new wy-argument, and for further information, see help("mice").

Numeric vector y is the incomplete Poisson distributed variable that shall be imputed. ry is the response indicator of y, with TRUE indicating an observed value and FALSE a missing observation in vector y respectively. wy is a logical vector of length(y). TRUE statements indicate locations in y for which imputations shall be generated (default is !ry). x is a matrix with length(y) rows containing complete covariates – the variables in the imputation model, which has been specified via mice’s predictorMatrix argument. The variables in x are used to predict missing information in y. Argument EV defines how the function handles extreme imputations. When set to TRUE, the function automatically detects extreme values among the imputations and replaces them by values obtained by predictive mean matching.

We use the standard glm.fit function from package stats to fit the Poisson model to the observed part of the data, i.e. x[ry,], y[ry]. Imputations are drawn from a Poisson distribution with mean \(\lambda\), which is the value predicted for the respective incomplete case.

4.4.1 Example

We generate an example data set with a Poisson distributed dependent variable \(y\) and three continuous predictors \(x1\)\(x3\) and introduce MAR missingness in the dependent variable.

set.seed( 1234 )
b0 <- 1
b1 <- .75
b2 <- -.25
b3 <- .5
N <- 5000
x1 <- rnorm(N)
x2 <- rnorm(N)
x3 <- rnorm(N)
lam <- exp( b0 + b1 * x1 + b2 * x2 + b3 * x3 )
y <- rpois( N, lam )
POIS <- data.frame( y, x1, x2, x3 )

## introduce MAR missingness to simulated data
generate.md <- function( data, pos = 1, Z = 2, pmis = .5, strength = c( .5, .5 ) ) 
{
 total <- round( pmis * nrow(data) )
 sm <- which( data[,Z] < mean( data[,Z] ) )
 gr <- which( data[,Z] > mean( data[,Z] ) )
 sel.sm <- sample( sm, round( strength[1] * total ) )
 sel.gr <- sample( gr, round( strength[2] * total ) )
 sel <- c( sel.sm, sel.gr )
 data[sel,pos] <- NA
 return(data)
}
MPOIS <- generate.md( POIS, pmis = .2, strength = c( .2, .8) )

This procedure generates 20% missing data in y. Missingness depends on variable x1. Cases whose x1-value is above the mean of x1 are more likely to have a missing value in y. The resulting incomplete data set is called MPOIS.

Imputations are then generated using wrapper function countimp(), which passes the arguments on to mice.

To select function mice.impute.poisson() to impute missing data in \(y\), the respective slot in method has to be set to "poisson". See help("mice") for details.

colnames(MPOIS)
## [1] "y"  "x1" "x2" "x3"

MPOIS has four variables. Thus, method has to be a vector of length 4. y is the first variable in MPOIS. We therefore need to set the first entry in method to "poisson" (or "poisson.boot" for the bootstrap variant). The other variables are completely observed and get method "".

imp <- countimp( MPOIS, method = c( "poisson" ,"" ,"" ,"" ), print=FALSE)
impb <- countimp( MPOIS, method = c( "poisson.boot" ,"" ,"" ,"" ), print=FALSE)

Note that if predictorMatrix is not specified, all other variables are used to predict missing information in y.

The resulting object imp is of class mids (a multiply imputed data set). Standard mice functions like with() and pool() can be applied to that object to automate repeated data analysis and pooling of results.

repeated.analysis <- with(imp, glm(y~x1+x2+x3,family = poisson))
summary(pool(repeated.analysis))
##                    est          se         t         df Pr(>|t|)
## (Intercept)  0.9779894 0.009909391  98.69320  700.87521        0
## x1           0.7507701 0.007478281 100.39341 1030.71997        0
## x2          -0.2342093 0.008082908 -28.97588   91.46838        0
## x3           0.4846562 0.008227717  58.90531   55.51060        0
##                  lo 95      hi 95 nmis        fmi     lambda
## (Intercept)  0.9585338  0.9974451   NA 0.07225868 0.06961507
## x1           0.7360957  0.7654445    0 0.05690214 0.05507394
## x2          -0.2502639 -0.2181547    0 0.22348656 0.20669134
## x3           0.4681709  0.5011415    0 0.29147133 0.26639537

For details about the respective slots of the output, see van Buuren & Groothuis-Oudshoorn (2011).

4.5 Imputation of overdispersed count data

countimp has two methods to impute overdispersed count:

  1. imputation based on a quasi-Poisson model, and
  2. imputation based on a negative binomial model.

Both methods can be used when the equidispersion assumption of the standard Poisson model is violated, i.e. when the variance of the count variable is larger in comparison to its mean.

4.5.1 Imputation based on a quasi-Poisson model

Method "quasipoisson" imputes univariate missing data based on a Bayesian quasi-Poisson GLM using the standard R function glm.fit (i.e. using family = quasipoisson(link = log)). Imputations are simulated from a negative binomial distribution to ensure an adequate dispersion of imputed values: The quasi-Poisson model estimates Pearson dispersion \(\delta\) that gives the amount of overdispersion in the data. Pearson dispersion \(\delta\) is stored in slot $dispersion of the output of the summary.glm function. We supply this parameter to the rnbinom() function from package stats to obtain the imputations. For details about the parametrization of the NB distribution used by rnbinom, see help("rnbinom").

4.5.2 Example

We simulate an overdispersed count variable y and generate about 20% missing data in this variable. Imputation, repeated data analysis and pooling of results works analogous to Section 4.4.1.

## simulate overdispersed count data
set.seed( 1234 )
b0 <- 1
b1 <- .75
b2 <- -.25
b3 <- .5
N <- 5000
x1 <- rnorm(N)
x2 <- rnorm(N)
x3 <- rnorm(N)
mu <- exp( b0 + b1 * x1 + b2 * x2 + b3 * x3 )
y <- MASS::rnegbin( N, theta = 2, mu )
NB <- data.frame( y, x1, x2, x3 )

## introduce MAR missingness to simulated data
total <- round( .2 * N )  ##number of missing data in y
sm <- which( NB[,2] < mean( NB[,2] ) )  ##subset: cases with x2<mean(x2)
gr <- which( NB[,2] > mean( NB[,2] ) )  ##subset: cases with x2>mean(x2)
sel.sm <- sample( sm, round( .2 * total ) ) ##select cases to set as missing
sel.gr <- sample( gr, round( .8 * total ) ) ##select cases to set as missing
sel <- c( sel.sm,sel.gr )
MNB <- NB
MNB[sel,1] <- NA    ##delete selected data

## impute missing data
imp <- countimp( MNB, method = c( "quasipoisson", "", "", "" ), print = FALSE)

## repeated data analysis and pooling of results
repeated.analysis <- with(imp, glm(y~x1+x2+x3,family=quasipoisson))
summary(pool(repeated.analysis))
##                    est         se         t        df Pr(>|t|)      lo 95
## (Intercept)  1.0075246 0.01613825  62.43084 4358.0686        0  0.9758854
## x1           0.7450712 0.01338057  55.68306  163.5364        0  0.7186503
## x2          -0.2469628 0.01274407 -19.37865  493.9668        0 -0.2720021
## x3           0.4555993 0.01448504  31.45310   38.6698        0  0.4262926
##                  hi 95 nmis        fmi     lambda
## (Intercept)  1.0391638   NA 0.01086766 0.01041384
## x1           0.7714922    0 0.16350855 0.15334069
## x2          -0.2219235    0 0.08866686 0.08498446
## x3           0.4849061    0 0.35243277 0.31978490

4.6 Imputation based on a negative binomial model

Method "nb" fits a negative binomial model using function glm.nb() from package MASS and simulates the imputations from a negative binomial distribution, using function rnegbin() from package MASS.

The imputation process works analogous to Section 4.4.1.

4.7 Imputation of count data with an excess number of zero counts

Methods zip and zinb impute missing data based on a zero-inflated Poisson or NB model respectively (for details, see Kleinke & Reinecke, 2013). Methods hp and hnb use a hurdle Poisson or NB model.

Users can specify different sets of predictors for the zero and count parts of the models. Allowed entries in the predictorMatrix are listed in Table 4.2.

Table 4.2: Allowed entries in the predictorMatrix
numeric code meaning
0 variable is not included in either model
1 variables is included in both the zero and the count model
2 count model predictor
3 zero-model predictor.

Note that the current mice version does not allow entries in the predictorMatrix apart from 0 and 1, unless the model is a multilevel model. Two-level imputation models in mice have prefix .2l. Kleinke & Reinecke (2013) therefore used this workaround and labeled the functions 2l.zip and 2l.zinb – which were included in countimp version 1 to allow the specification of different sets of predictors for the zero and count parts of the model via the predictorMatrix. This however lead to some confusion among empirical researchsers, since these functions did not fit two-level models.

Caution: In version 2 of package countimp, methods 2l.zip and 2l.zinb in fact fit two-level models – see Section 4.10.

In version 2 of package countimp, we have found a better solution. Wrapper function countimp takes care of the problem. It is therefore important that users call countimp() rather than mice() to create the imputations, when the imputation model is either a zero-inflation model or a hurdle model. countimp() takes the same arguments as mice() and passes them on to mice. Imputations are then created by mice.

4.7.1 Monte Carlo Simulation

Kleinke & Reinecke (2013) have tested the zero-inflation functions (methods "2l.zip" and "2l.zinb") in various scenarios and found that the methods performed adequately well, when the empricial data were in fact zero-inflated Poisson or negative binomial. Here, we use a highly similar set-up to evaluate the new hurdle model variants of these functions, which are availabe in version 2 of the countimp package.

Note that unlike zero-inflation models, hurdle models allow only one source of zeros. The zero model determines, if case \(i\) has a zero or non-zero observation and the count model determines, which non-zero observation case \(i\) has. Hurdle models assume that the residuals of the hurdle component (here the zero model) and the outcome model (here the count model) are uncorrelated. This assumption is plausible if it is quite distinct groups who form the zero versus non-zero category.

An example, where a hurdle model might be an appropriate analyis model, is the count of cigarettes smoked per day. Non-smokers will quite certainly have a zero count, while smokers will vary in their number of cigarettes smoked per day. Furthermore, there might be different sets of variables that will predict, if persons are smokers or non-smokers, while other variables will predict, how many cigarettes a smoker will consume per day.

To test the new hurdle model imputation functions, we created 1000 data sets with sample sizes \(N\in\{500,1000,10000\}\) respectively that contained four variables: \(y\) the dependent count variable that has an excess number of zeros, and continuous predictors \(x1\), \(x2\), and \(x3\).

\(x_1\)\(x_3\) were drawn from normal \(\mathcal{N}(0,1)\) populations. \(\mu_{0i}\) was the individual probability of belonging to the count versus the zero-group and was determined by a logit model

\[\begin{equation} \mu_{0i}=\text{invlogit}(2\cdot x_3) \end{equation}\]

The respective observation in \(y\) was drawn from a binomial distribution with size 1 and probabiliy \(\mu_{0i}\). Finally, a “1” in \(y\) was replaced by a draw from a zero-truncated Poisson distribution with mean \(\exp(1+.3\cdot x_{1i}+.3\cdot x_{2i})\) in the Poisson conditions or from a zero-truncated negative binomial distribution with mean \(\exp(1+.3\cdot x_{1i}+.3\cdot x_{2i})\) and overdispersion parameter \(\theta=1\) in the negative binomial conditions. MAR missingness was then introduced as outlined in Kleinke & Reinecke (2013). Missing data were imputed \(m=5\) times using both the Bayesian regression and the bootstrap regression variants of our hurdle model imputation functions. Function hurdle from package pscl was used to fit the corresponding hurdle model to the completed data sets. Results of the Poisson conditions are displayed in Tables 4.3 and 4.4. \(\beta\) are the coefficients in the count part of the model, \(\gamma\) the coefficients in the zero model. \(Q\) are the simulated true population quantities, \(\widehat{Q}\) the Monte Carlo estimates, \(SD_{\widehat{Q}}\) is the standard deviation of the estimates across the 1000 replications. We also report %Bias and 95% coverage rates. As can be seen in Tables 4.3 and 4.4, estimates are always very close to the true population quantities, bias is very small and coverage is adequate for all parameters regardless of sample size.

Table 4.3: Performance of MI based on a Bayesian regression hurdle Poisson model
N Parameter \(\beta_0\) \(\beta_1\) \(\beta_2\) \(\gamma_0\) \(\gamma_1\)
500 \(\widehat{Q}\) 0.996 0.295 0.294 -0.005 2.028
500 \(SD_{\widehat{Q}}\) 0.052 0.049 0.050 0.145 0.228
500 Bias 0.004 0.005 0.006 0.005 -0.028
500 Coverage Rate 96.700 95.300 95.800 95.300 94.900
1000 \(\widehat{Q}\) 0.997 0.297 0.296 -0.003 2.014
1000 \(SD_{\widehat{Q}}\) 0.039 0.034 0.036 0.101 0.163
1000 Bias 0.003 0.003 0.004 0.003 -0.014
1000 Coverage Rate 94.600 95.200 95.200 95.800 95.400
10000 \(\widehat{Q}\) 1.000 0.299 0.300 -0.001 2.004
10000 \(SD_{\widehat{Q}}\) 0.012 0.011 0.011 0.032 0.048
10000 Bias 0.000 0.001 0.000 0.001 -0.004
10000 Coverage Rate 94.400 94.000 96.100 95.200 94.900
Table 4.4: Performance of MI based on a bootstrap regression hurdle Poisson model
N Parameter \(\beta_0\) \(\beta_1\) \(\beta_2\) \(\gamma_0\) \(\gamma_1\)
500 \(\widehat{Q}\) 0.994 0.293 0.293 -0.007 2.037
500 \(SD_{\widehat{Q}}\) 0.052 0.049 0.049 0.147 0.234
500 Bias 0.006 0.007 0.007 0.007 -0.037
500 Coverage Rate 95.400 95.300 96.200 95.100 95.000
1000 \(\widehat{Q}\) 0.996 0.295 0.295 -0.005 2.017
1000 \(SD_{\widehat{Q}}\) 0.039 0.034 0.035 0.101 0.164
1000 Bias 0.004 0.005 0.005 0.005 -0.017
1000 Coverage Rate 94.600 95.200 95.100 95.500 93.800
10000 \(\widehat{Q}\) 1.000 0.299 0.300 -0.001 2.004
10000 \(SD_{\widehat{Q}}\) 0.012 0.011 0.011 0.032 0.049
10000 Bias 0.000 0.001 0.000 0.001 -0.004
10000 Coverage Rate 93.900 94.100 94.400 94.200 95.300
Table 4.5: Performance of MI based on a Bayesian regression hurdle NB model
N Parameter \(\beta_0\) \(\beta_1\) \(\beta_2\) \(\gamma_0\) \(\gamma_1\)
500 \(\widehat{Q}\) 0.990 0.284 0.290 0.002 2.025
500 \(SD_{\widehat{Q}}\) 0.132 0.095 0.094 0.142 0.230
500 Bias 0.010 0.016 0.010 -0.002 -0.025
500 Coverage Rate 94.900 93.500 94.400 95.200 95.100
1000 \(\widehat{Q}\) 0.992 0.289 0.290 -0.004 2.019
1000 \(SD_{\widehat{Q}}\) 0.098 0.065 0.064 0.100 0.162
1000 Bias 0.008 0.011 0.010 0.004 -0.019
1000 Coverage Rate 94.400 94.400 93.800 95.900 96.400
10000 \(\widehat{Q}\) 1.000 0.298 0.299 0.001 2.003
10000 \(SD_{\widehat{Q}}\) 0.030 0.021 0.021 0.032 0.050
10000 Bias 0.000 0.002 0.001 -0.001 -0.003
10000 Coverage Rate 94.200 95.000 95.500 93.900 94.700
Table 4.6: Performance of MI based on a bootstrap regression hurdle NB model
N Parameter \(\beta_0\) \(\beta_1\) \(\beta_2\) \(\gamma_0\) \(\gamma_1\)
500 \(\widehat{Q}\) 0.992 0.280 0.288 0.002 2.039
500 \(SD_{\widehat{Q}}\) 0.134 0.094 0.093 0.145 0.230
500 Bias 0.008 0.020 0.012 -0.002 -0.039
500 Coverage Rate 94.600 92.900 92.800 94.800 94.600
1000 \(\widehat{Q}\) 0.991 0.285 0.289 -0.003 2.024
1000 \(SD_{\widehat{Q}}\) 0.096 0.065 0.066 0.101 0.164
1000 Bias 0.009 0.015 0.011 0.003 -0.024
1000 Coverage Rate 95.400 94.300 94.200 95.800 94.400
10000 \(\widehat{Q}\) 1.000 0.297 0.298 0.001 2.004
10000 \(SD_{\widehat{Q}}\) 0.030 0.021 0.020 0.032 0.050
10000 Bias 0.000 0.003 0.002 -0.001 -0.004
10000 Coverage Rate 93.600 93.800 95.300 95.300 94.600

Results of the NB conditions are displayed in Tables 4.5 and 4.6. Again parameter estimates are typically accurate and coverage is fine for all parameters. In the small sample size condition, with \(N=500\), overdispersion parameter \(\theta\) was slightly overestimated (1.149 for the Bayesian regression method and 1.187 for the bootstrap regression method respectively).

With \(N=1000\), overdispersion parameter \(\theta\) was 1.082 for the Bayesian regression method, and 1.102 for the bootstrap regression method. When \(N=10000\), overdispersion parameter \(\theta\) was 1.010 for the Bayesian regression method, and 1.013 for the bootstrap regression method. In comparison, the mean estimate for \(\theta\) of the complete data sets – i.e. before any data in \(y\) were deleted – was 1.056.

All in all, results suggest that the imputation methods work sufficiently well.

4.8 Imputation of two-level Poisson distributed data

mice.impute.2l.poisson() imputes missing data based on a generalized linear mixed effects Poisson model. Note that the function requires the data to be in long format, i.e. the repeated measurements are stacked upon another. Additional time and id variables indicate measurement timepoint (when the data are panel data) and cluster membership. The function is described in detail in Kleinke & Reinecke (2015a). To specify the imputation model, we use the same coding scheme as other two-level imputation functions in mice. Allowed entries in the predictorMatrix are given in Table 4.7:

Table 4.7: Allowed entries in the predictorMatrix
numeric code meaning
0 variable is not included in the imputation model
1 variable is included as a fixed effect
2 variable is included as a fixed and random effect
-2 class variable.

The intercept is always treated as fixed and random factor. If this is not desired, the .noint variant of the function could be used, which treats the intercept only as a fixed, but not as a random effect (see help("mice.impute.2l.poisson")).

4.9 Imputation of overdispersed two-level count data

Method 2l.nb imputes incomplete data based on a two-level NB model (for details, see Kleinke & Reinecke, 2015b). In version 2 of package countimp, we have replaced package glmmADMB by glmmTMB to estimate the two-level NB model. In our practical experience, estimation by glmmTMB is faster and more stable. Note that the function requires data in long format and uses the same coding scheme as other two-level imputation functions in (cf. van Buuren & Groothuis-Oudshoorn, 2011). The specification of the imputation method and model works analogous to other multilevel imputation functions in mice, see for example help("mice.impute.2l.pan"). Allowed entries in the predictorMatrix are listed in Table 4.8.

Table 4.8: Allowed entries in the predictorMatrix
numeric code meaning
0 variable is not included in the imputation model
1 variable is included as a fixed effect
2 variable is included as a fixed and random effect
-2 class variable.

4.9.1 Monte Carlo Simulation

We tested the refurbished version of mice.impute.2l.nb by means of Monte Carlo simulation (with 200 replications). The design was very simular to the one used by Kleinke & Reinecke (2015b).

We simulated data of \(g=50\) groups of varying group sizes \(n_j\), \(j\in (1,2,\dots,g)\).

\(n_j\) were drawn randomly from all integer numbers between 100 and 400. The data of the 50 groups were generated separately and then stacked upon each other to create data sets in long format. Each data set consisted of one continuous individual level predictor \(x\), one continuous group level predictor \(z\), and the overdispersed dependent count variable \(y\). The intercept and the slope of \(x\) were treated as random factors (i.e. they were allowed to vary across the 50 groups). We set the following population parameters: \(\beta_0=1\), the fixed intercept term, \(\beta_1=.75\), the coefficient of the individual level predictor \(x\), \(\beta_2=.5\), the coefficient of the group level predictor \(z\), and \(\theta=2\), the overdispersion parameter. Random effects standard deviations were \(.5\) and \(.3\) for the intercept term (\(\tau_{00}\)) and the slope of \(x\) (\(\tau_{11}\)) respectively. The random effects correlation (\(\tau_{01}\)) was set to \(0\). \(x_j\), \(z_j\), and the individual level residuals \(e_j\) were simulated from \(\mathcal{N}(0,1)\) normal populations. Random effects were also drawn from normal \(\mathcal{N}(0,\tau)\) populations. Data were generated as follows.

We first obtained parameters

\[\begin{equation*} {\beta_{j}}={\Gamma}{Z_{j}}+{u_{j}} \end{equation*}\]

where

\[\begin{equation*} {\beta_{j}}=\begin{bmatrix} {\beta_{0j}} \\ {\beta_{1j}} \end{bmatrix}, {\Gamma} = \begin{bmatrix} \gamma_{00} & \gamma_{01} \\ \gamma_{10} & \gamma_{11} \\ \end{bmatrix} = \begin{bmatrix} 1 & .5 \\ 0.75 & 0 \\ \end{bmatrix}, {Z_j}=\begin{bmatrix} 1 \\ z_{j} \end{bmatrix}, \text{and } {u_{j}}=\begin{bmatrix} {u_{0j}} \\ {u_{1j}} \end{bmatrix}. \end{equation*}\]

For each individual \(i\), the dependent count variable \(y_{ij}\) was then drawn from \(y_{ij}\sim\text{NB}(\mu_{ij}=\exp(X_{ij}\beta_j),\theta=2)\), using the rnegbin() function from R package MASS, where \(X_{ij}=\begin{bmatrix}1&x_{ij}\end{bmatrix}\) .

In the next step, we introduced missing data in \(y\). Predictors \(x\), and \(z\) remained fully observed. We introduced missing data according to the following rule, where \(p_{ij}\) is the missingness probability and NA denotes a missing value (i.e. “not available”):

\[\begin{align*} p_{ij}&=\text{invlogit}(-1+z_{ij}) \\ v_{ij}&\sim \mathcal{U}(0,1) \\ y_{ij} &= \text{NA, } \text{if } v_{ij}<p_{ij}. \end{align*}\]

This created an average percentage of missing data in \(y\) of 30.2% (\(SD=3.0\%\)).

The incomplete variable \(y\) was then imputed \(m=5\) times using both the Bayesian regression and bootstrap regression versions of the function. The imputation model included variables \(x\) and \(z\). The intercept and the slope of \(x\) were treated as random factors. We used the mice default settings and monitored convergence as outlined in van Buuren & Groothuis-Oudshoorn (2011). The data sets were then analyzed by the same NB2 model that was used for data imputation. Results were combined using Rubin’s formula (Rubin, 1987). Results are displayed in Tables 4.9 and 4.10. The tables display (a) the average parameter estimate across the 200 replications, (b) bias, which is defined as the difference between the true parameter and the average parameter estimate, (c) %Bias, which is bias, devided by the true parameter multiplied by 100%, and 95% coverage (i.e. the percentage of confidence intervals that include the true parameter).

Table 4.9: Performance of MI based on a Bayesian regression two-level NB2 model
Parameter \(\beta_0\) \(\beta_1\) \(\beta_2\) \(\sigma_{00}\) \(\sigma_{11}\) \(\sigma_{01}\) \(\theta\)
Q 1.000 0.750 0.500 0.500 0.300 0.000 2.000
\(\widehat{Q}\) 0.996 0.748 0.503 0.471 0.294 -0.088 1.961
Bias 0.004 0.002 -0.003 0.029 0.006 0.088 0.039
%Bias 0.400 0.267 -0.600 5.800 2.000 Inf 1.950
Coverage Rate 93.500 95.500 94.500 NA NA NA NA
Table 4.10: Performance of MI based on a bootstrap regression two-level NB2 model
Parameter \(\beta_0\) \(\beta_1\) \(\beta_2\) \(\sigma_{00}\) \(\sigma_{11}\) \(\sigma_{01}\) \(\theta\)
Q 1.000 0.750 0.500 0.500 0.300 0.000 2.000
\(\widehat{Q}\) 0.996 0.748 0.504 0.472 0.295 -0.102 1.977
Bias 0.004 0.002 -0.004 0.028 0.005 0.102 0.023
%Bias 0.400 0.267 -0.800 5.600 1.667 Inf 1.150
Coverage Rate 94.000 95.500 93.500 NA NA NA NA

As can be seen, the average parameter estimates are close to the defined population quantities, %Bias is below 10% and coverage is above 90% for all parameters, which means that the imputation function works nicely when (like in this case), model assumptions are met.

4.10 Imputation of zero-inflated two-level count data

Methods "2l.hp" and "2l.hnb" impute zero-inflated two-level count data based on a two-level hurdle model. Analogously, methods "2l.zip" and "2l.zinb" create imputations based on a two-level zero-inflated Poisson or NB model respectively. The .noint variants of the functions treat the intercept only as a fixed, but not as a random effect. It may be specified, if the intercept is excluded from the random part of the zero model (.noint.zero), the random part of the count model (.noint.count), or from the random part of both models (.noint.both). For details about all model variants, see help("mice.impute.2l.hnb"

Different sets of predictors can be specified for the zero and count parts of the model. Allowed entries in the predictorMatrix are listed in Table 4.11:

Table 4.11: Allowed entries in the predictorMatrix
numeric code meaning
0 variable is not included in the imputation model
1 variable is included as a fixed effect
2 variable is included as a fixed and random effect
3 variable is included as a fixed effect (count model only)
4 variable is included as a fixed and random effect (count model only)
5 variable is included as a fixed effect (zero model only)
6 variable is included as a fixed and random effect (zero model only)
-2 class variable

4.10.1 Monte Carlo Simulation

We evaluated these new or updated imputation procedures for zero-inflated two-level data by means of Monte Carlo simulation (with 200 replications). The design was similar to the one used by Kleinke & Reinecke (2015b): We simulated data of \(g=50\) groups of sample sizes \(n_j=100\), \(j \in 1\dots,g\). The data of the 50 groups were generated separately and then stacked upon each other to create data sets in long format. Each data set consisted of one continuous individual level predictor \(x\), and the overdispersed dependent count variable \(y\). \(x\) both influenced the zero and the count process. The intercept and the slope of \(x\) were treated as random in both the zero and the count part of the model. We set the following population parameters: \(\beta_0=1\), and \(\beta_1=.75\) the fixed intercept and slope terms respectively of the count part of the model, and \(\gamma_0=.0\), and \(\gamma_1=.5\) the fixed intercept and slope terms in the zero part of the model. Random effects standard deviations were \(.5\) and \(.3\) for the intercept term (\(\tau_{00}\)) and the slope of \(x\) (\(\tau_{11}\)) respectively. The random effects correlation (\(\tau_{01}\)) was set to \(0\). \(x_{ij}\) were simulated from a \(\mathcal{N}(0,1)\) normal population. Random effects \(u\) were also drawn from normal \(\mathcal{N}(0,\tau)\) populations. Data were generated as follows. We first obtained parameters for the zero part and the count part of the model. For each group \(j\), \(j\in{1,\dots,g}\) we simulated parameters:

\[\begin{align*} \gamma_{0j}&=0+u_{0jz}\\ \gamma_{1j}&=.5+u_{1jz}\\ \beta_{0j}&=1+u_{0jc}\\ \beta_{1j}&=.75+u_{1jc} \end{align*}\]

The probability of belonging to the zero versus non-zero category was then determined by

\[\begin{equation*} \mu_{ij}=\text{invlogit}(\gamma_{0j}+\gamma_{1j}x_{1i}) \end{equation*}\]

We then drew random uniform \(z\sim\mathcal{U}(0,1)\) numbers and if \(z_{ij}<\mu_{ij}\) then the dependent variable \(y\) was set to 0, and 1 otherwise. In the next step, a nonzero observation in \(y\) was replaced by a random draw from a zero-truncated Poisson distribution with mean \(\exp(\beta_{0j}+\beta_{1j}x_{1ij})\) in the Poisson condition, and by a random draw from a zero-truncated NB distribution with mean \(\exp(\beta_{0j}+\beta_{1j} x_{1ij})\) and overdispersion parameter \(\theta=2\) in the NB condition.

Then, we introduced missing data in \(y\). Predictor \(x\) remained fully observed. We introduced missing data according to the following rule, where \(p_{ij}\) is the missingness probability and NA denotes a missing value (i.e. “not available”):

\[\begin{align*} p_{ij}&=\text{invlogit}(-1+x_{ij}) \\ v_{ij}&\sim \mathcal{U}(0,1) \\ y_{ij} &= \text{NA, } \text{if } v_{ij}<p_{ij}. \end{align*}\]

This created an average percentage of missing data in \(y\) of 30.4%.

The incomplete variable \(y\) was then imputed \(m=5\) times using both the Bayesian regression and bootstrap regression versions of the respective function (hurdle Poisson imputation in the Poisson condition, and hurdle NB imputation in the negative binomial condition). \(x\) was included as predictor for the zero and count part of the model. The intercept and the slope of \(x\) were treated as random factors (again in both parts of the model). We used the mice default settings and monitored convergence (for details, see Chapter 5). The \(m\) completed data sets were then analyzed by the same model that was used for data imputation. Results were combined using Rubin’s formula (Rubin, 1987).

Results of hurdle Poisson imputation are displayed in Tables 4.12 and 4.13, results of hurdle NB imputation are displayed in Tables 4.14 and 4.15. The tables display the average parameter estimate across the 200 replications, the standard deviation of the estimates across the replications, bias, percent bias and 95% coverage.

Table 4.12: Performance of MI based on a Bayesian regression two-level hurdle Poisson model
\(Q\) \(\widehat{Q}\) Bias %Bias Coverage Rate
\(\beta_0\) 1.00 1.001 -0.001 -0.100 94.924
\(\beta_1\) 0.75 0.746 0.004 0.533 93.909
\(\gamma_0\) 0.00 -0.006 0.006 Inf 92.893
\(\gamma_1\) 0.50 0.492 0.008 1.600 96.447
\(\sigma_{00c}\) 0.50 0.491 0.009 1.800 NA
\(\sigma_{11c}\) 0.30 0.283 0.017 5.667 NA
\(\sigma_{01c}\) 0.00 -0.008 0.008 Inf NA
\(\sigma_{00z}\) 0.50 0.477 0.023 4.600 NA
\(\sigma_{11z}\) 0.30 0.271 0.029 9.667 NA
\(\sigma_{01z}\) 0.00 -0.053 0.053 Inf NA
Table 4.13: Performance of MI based on a bootstrap regression two-level hurdle Poisson model
\(Q\) \(\widehat{Q}\) Bias %Bias Coverage Rate
\(\beta_0\) 1.00 1.011 -0.011 -1.100 94.416
\(\beta_1\) 0.75 0.748 0.002 0.267 96.447
\(\gamma_0\) 0.00 -0.006 0.006 Inf 91.878
\(\gamma_1\) 0.50 0.495 0.005 1.000 91.878
\(\sigma_{00c}\) 0.50 0.446 0.054 10.800 NA
\(\sigma_{11c}\) 0.30 0.261 0.039 13.000 NA
\(\sigma_{01c}\) 0.00 -0.127 0.127 Inf NA
\(\sigma_{00z}\) 0.50 0.439 0.061 12.200 NA
\(\sigma_{11z}\) 0.30 0.269 0.031 10.333 NA
\(\sigma_{01z}\) 0.00 -0.108 0.108 Inf NA
Table 4.14: Performance of MI based on a Bayesian regression two-level hurdle NB model
\(Q\) \(\widehat{Q}\) Bias %Bias Coverage Rate
\(\beta_0\) 1.00 0.993 0.007 0.700 93.939
\(\beta_1\) 0.75 0.735 0.015 2.000 92.424
\(\gamma_0\) 0.00 -0.003 0.003 Inf 93.939
\(\gamma_1\) 0.50 0.497 0.003 0.600 96.970
\(\sigma_{00c}\) 0.50 0.480 0.020 4.000 NA
\(\sigma_{11c}\) 0.30 0.276 0.024 8.000 NA
\(\sigma_{01c}\) 0.00 -0.059 0.059 Inf NA
\(\sigma_{00z}\) 0.50 0.485 0.015 3.000 NA
\(\sigma_{11z}\) 0.30 0.274 0.026 8.667 NA
\(\sigma_{01z}\) 0.00 -0.054 0.054 Inf NA
\(\theta\) 2.00 2.098 -0.098 -4.900 NA
Table 4.15: Performance of MI based on a bootstrap regression two-level hurdle NB model
\(Q\) \(\widehat{Q}\) Bias %Bias Coverage Rate
\(\beta_0\) 1.00 0.996 0.004 0.400 92.929
\(\beta_1\) 0.75 0.743 0.007 0.933 93.939
\(\gamma_0\) 0.00 -0.001 0.001 Inf 93.434
\(\gamma_1\) 0.50 0.500 0.000 0.000 95.455
\(\sigma_{00c}\) 0.50 0.440 0.060 12.000 NA
\(\sigma_{11c}\) 0.30 0.262 0.038 12.667 NA
\(\sigma_{01c}\) 0.00 -0.151 0.151 Inf NA
\(\sigma_{00z}\) 0.50 0.444 0.056 11.200 NA
\(\sigma_{11z}\) 0.30 0.268 0.032 10.667 NA
\(\sigma_{01z}\) 0.00 -0.127 0.127 Inf NA
\(\theta\) 2.00 1.991 0.009 0.450 NA

Estimates are sufficiently accurate. It appears that the bootstrap variants performed slightly worse than the Bayesian variants, especially regarding the random part of the model. This might be caused by the block bootstrap approach implemented here (we re-sample clusters rather than individual cases). This approach requires a substantial number of clusters. Future research needs to establish under which scenarios it is better to use the Baysian variants of the functions and when to use the bootstrap variants. Little is yet known about level-1 and level-2 sample size requirements.