Chapter 4 `countimp`

4.1 A cautionary note

Package countimp creates imputations based on parametric count models. Plausibilty of the obtained imputations will depend upon the correct specification of the imputation procedure, i.e. the count data regression model that is used to predict missing information must fit the empirical data well.

Research by Kleinke & Reinecke (2013) suggests that misspecifications of the (count) imputation model can bias statistical inferences quite noticeably. We therefore advise researchers to spend sufficient time on modelling. We would like to refer readers to the excellent book by Hilbe (2011) to determine which count model fits the data best.

We would also like to draw attention to R packages Qtools and ImputeRobust, which allow to create imputations based on nonparametric quantile regression and semi-parametric generalized additive models, which might be a suitable alternative in situations, where assumptions of parametric count models are violated.

4.2 Count data models

An overview of parametric imputation models available in package countimp is given in Table 4.1. More details on the theoretical background of count regression models is given in Zeileis et al. (2008), and in Hilbe (2011).

Table 4.1: An overview of imputation methods and models in package `countimp`
Function Name	model / family
`mice.impute.poisson()`	Poisson
`mice.impute.quasipoisson()`	Quasi-Poisson
`mice.impute.nb()`	negative binomial
`mice.impute.zip()`	zero-inflated Poisson
`mice.impute.zinb()`	zero-inflated negative binomial
`mice.impute.hp()`	hurdle Poisson
`mice.impute.hnb()`	hurdle negative binomial
`mice.impute.2l.poisson()`	two-level Poisson
`mice.impute.2l.nb()`	two-level negative binomial
`mice.impute.2l.zip()`	two-level zero-inflated Poisson
`mice.impute.2l.zinb()`	two-level zero-inflated NB
`mice.impute.2l.hp()`	two-level hurdle Poisson
`mice.impute.2l.hnb()`	two-level hurdle negative binomial

4.3 Installation

The latest version can be installed from GitHub using function install_git from package devtools:

devtools::install_git(url = "https://github.com/kkleinke/countimp", 
                      branch = "master")

4.4 Imputation of incomplete Poisson distributed variables

The function to impute an incomplete Poisson distributed variable is called mice.impute.poisson(). Appending .boot to the function name calls the bootstrap regression variant (see help("mice.impute.poisson.boot")). The functions have the following arguments:

library(countimp)
args("mice.impute.poisson")

## function (y, ry, x, wy = NULL, EV = TRUE, ...) 
## NULL

Arguments y, ry, x need not be specified. They are automatically obtained from mice. For information on how to use mice’s new wy-argument, and for further information, see help("mice").

Numeric vector y is the incomplete Poisson distributed variable that shall be imputed. ry is the response indicator of y, with TRUE indicating an observed value and FALSE a missing observation in vector y respectively. wy is a logical vector of length(y). TRUE statements indicate locations in y for which imputations shall be generated (default is !ry). x is a matrix with length(y) rows containing complete covariates – the variables in the imputation model, which has been specified via mice’s predictorMatrix argument. The variables in x are used to predict missing information in y. Argument EV defines how the function handles extreme imputations. When set to TRUE, the function automatically detects extreme values among the imputations and replaces them by values obtained by predictive mean matching.

We use the standard glm.fit function from package stats to fit the Poisson model to the observed part of the data, i.e. x[ry,], y[ry]. Imputations are drawn from a Poisson distribution with mean $\lambda$, which is the value predicted for the respective incomplete case.

4.4.1 Example

We generate an example data set with a Poisson distributed dependent variable $y$ and three continuous predictors $x1$–$x3$ and introduce MAR missingness in the dependent variable.

set.seed( 1234 )
b0 <- 1
b1 <- .75
b2 <- -.25
b3 <- .5
N <- 5000
x1 <- rnorm(N)
x2 <- rnorm(N)
x3 <- rnorm(N)
lam <- exp( b0 + b1 * x1 + b2 * x2 + b3 * x3 )
y <- rpois( N, lam )
POIS <- data.frame( y, x1, x2, x3 )

## introduce MAR missingness to simulated data
generate.md <- function( data, pos = 1, Z = 2, pmis = .5, strength = c( .5, .5 ) ) 
{
 total <- round( pmis * nrow(data) )
 sm <- which( data[,Z] < mean( data[,Z] ) )
 gr <- which( data[,Z] > mean( data[,Z] ) )
 sel.sm <- sample( sm, round( strength[1] * total ) )
 sel.gr <- sample( gr, round( strength[2] * total ) )
 sel <- c( sel.sm, sel.gr )
 data[sel,pos] <- NA
 return(data)
}
MPOIS <- generate.md( POIS, pmis = .2, strength = c( .2, .8) )

This procedure generates 20% missing data in y. Missingness depends on variable x1. Cases whose x1-value is above the mean of x1 are more likely to have a missing value in y. The resulting incomplete data set is called MPOIS.

Imputations are then generated using wrapper function countimp(), which passes the arguments on to mice.

To select function mice.impute.poisson() to impute missing data in $y$, the respective slot in method has to be set to "poisson". See help("mice") for details.

colnames(MPOIS)

## [1] "y"  "x1" "x2" "x3"

MPOIS has four variables. Thus, method has to be a vector of length 4. y is the first variable in MPOIS. We therefore need to set the first entry in method to "poisson" (or "poisson.boot" for the bootstrap variant). The other variables are completely observed and get method "".

imp <- countimp( MPOIS, method = c( "poisson" ,"" ,"" ,"" ), print=FALSE)
impb <- countimp( MPOIS, method = c( "poisson.boot" ,"" ,"" ,"" ), print=FALSE)

Note that if predictorMatrix is not specified, all other variables are used to predict missing information in y.

The resulting object imp is of class mids (a multiply imputed data set). Standard mice functions like with() and pool() can be applied to that object to automate repeated data analysis and pooling of results.

repeated.analysis <- with(imp, glm(y~x1+x2+x3,family = poisson))
summary(pool(repeated.analysis))

##                    est          se         t         df Pr(>|t|)
## (Intercept)  0.9779894 0.009909391  98.69320  700.87521        0
## x1           0.7507701 0.007478281 100.39341 1030.71997        0
## x2          -0.2342093 0.008082908 -28.97588   91.46838        0
## x3           0.4846562 0.008227717  58.90531   55.51060        0
##                  lo 95      hi 95 nmis        fmi     lambda
## (Intercept)  0.9585338  0.9974451   NA 0.07225868 0.06961507
## x1           0.7360957  0.7654445    0 0.05690214 0.05507394
## x2          -0.2502639 -0.2181547    0 0.22348656 0.20669134
## x3           0.4681709  0.5011415    0 0.29147133 0.26639537

For details about the respective slots of the output, see van Buuren & Groothuis-Oudshoorn (2011).

4.5 Imputation of overdispersed count data

countimp has two methods to impute overdispersed count:

imputation based on a quasi-Poisson model, and
imputation based on a negative binomial model.

Both methods can be used when the equidispersion assumption of the standard Poisson model is violated, i.e. when the variance of the count variable is larger in comparison to its mean.

4.5.1 Imputation based on a quasi-Poisson model

Method "quasipoisson" imputes univariate missing data based on a Bayesian quasi-Poisson GLM using the standard R function glm.fit (i.e. using family = quasipoisson(link = log)). Imputations are simulated from a negative binomial distribution to ensure an adequate dispersion of imputed values: The quasi-Poisson model estimates Pearson dispersion $\delta$ that gives the amount of overdispersion in the data. Pearson dispersion $\delta$ is stored in slot $dispersion of the output of the summary.glm function. We supply this parameter to the rnbinom() function from package stats to obtain the imputations. For details about the parametrization of the NB distribution used by rnbinom, see help("rnbinom").

4.5.2 Example

We simulate an overdispersed count variable y and generate about 20% missing data in this variable. Imputation, repeated data analysis and pooling of results works analogous to Section 4.4.1.

## simulate overdispersed count data
set.seed( 1234 )
b0 <- 1
b1 <- .75
b2 <- -.25
b3 <- .5
N <- 5000
x1 <- rnorm(N)
x2 <- rnorm(N)
x3 <- rnorm(N)
mu <- exp( b0 + b1 * x1 + b2 * x2 + b3 * x3 )
y <- MASS::rnegbin( N, theta = 2, mu )
NB <- data.frame( y, x1, x2, x3 )

## introduce MAR missingness to simulated data
total <- round( .2 * N )  ##number of missing data in y
sm <- which( NB[,2] < mean( NB[,2] ) )  ##subset: cases with x2<mean(x2)
gr <- which( NB[,2] > mean( NB[,2] ) )  ##subset: cases with x2>mean(x2)
sel.sm <- sample( sm, round( .2 * total ) ) ##select cases to set as missing
sel.gr <- sample( gr, round( .8 * total ) ) ##select cases to set as missing
sel <- c( sel.sm,sel.gr )
MNB <- NB
MNB[sel,1] <- NA    ##delete selected data

## impute missing data
imp <- countimp( MNB, method = c( "quasipoisson", "", "", "" ), print = FALSE)

## repeated data analysis and pooling of results
repeated.analysis <- with(imp, glm(y~x1+x2+x3,family=quasipoisson))
summary(pool(repeated.analysis))

##                    est         se         t        df Pr(>|t|)      lo 95
## (Intercept)  1.0075246 0.01613825  62.43084 4358.0686        0  0.9758854
## x1           0.7450712 0.01338057  55.68306  163.5364        0  0.7186503
## x2          -0.2469628 0.01274407 -19.37865  493.9668        0 -0.2720021
## x3           0.4555993 0.01448504  31.45310   38.6698        0  0.4262926
##                  hi 95 nmis        fmi     lambda
## (Intercept)  1.0391638   NA 0.01086766 0.01041384
## x1           0.7714922    0 0.16350855 0.15334069
## x2          -0.2219235    0 0.08866686 0.08498446
## x3           0.4849061    0 0.35243277 0.31978490

4.6 Imputation based on a negative binomial model

Method "nb" fits a negative binomial model using function glm.nb() from package MASS and simulates the imputations from a negative binomial distribution, using function rnegbin() from package MASS.

The imputation process works analogous to Section 4.4.1.

4.7 Imputation of count data with an excess number of zero counts

Methods zip and zinb impute missing data based on a zero-inflated Poisson or NB model respectively (for details, see Kleinke & Reinecke, 2013). Methods hp and hnb use a hurdle Poisson or NB model.

Users can specify different sets of predictors for the zero and count parts of the models. Allowed entries in the predictorMatrix are listed in Table 4.2.

Table 4.2: Allowed entries in the `predictorMatrix`
numeric code	meaning
0	variable is not included in either model
1	variables is included in both the zero and the count model
2	count model predictor
3	zero-model predictor.

Note that the current mice version does not allow entries in the predictorMatrix apart from 0 and 1, unless the model is a multilevel model. Two-level imputation models in mice have prefix .2l. Kleinke & Reinecke (2013) therefore used this workaround and labeled the functions 2l.zip and 2l.zinb – which were included in countimp version 1 to allow the specification of different sets of predictors for the zero and count parts of the model via the predictorMatrix. This however lead to some confusion among empirical researchsers, since these functions did not fit two-level models.

Caution: In version 2 of package countimp, methods 2l.zip and 2l.zinb in fact fit two-level models – see Section 4.10.

In version 2 of package countimp, we have found a better solution. Wrapper function countimp takes care of the problem. It is therefore important that users call countimp() rather than mice() to create the imputations, when the imputation model is either a zero-inflation model or a hurdle model. countimp() takes the same arguments as mice() and passes them on to mice. Imputations are then created by mice.

4.7.1 Monte Carlo Simulation

Kleinke & Reinecke (2013) have tested the zero-inflation functions (methods "2l.zip" and "2l.zinb") in various scenarios and found that the methods performed adequately well, when the empricial data were in fact zero-inflated Poisson or negative binomial. Here, we use a highly similar set-up to evaluate the new hurdle model variants of these functions, which are availabe in version 2 of the countimp package.

Note that unlike zero-inflation models, hurdle models allow only one source of zeros. The zero model determines, if case $i$ has a zero or non-zero observation and the count model determines, which non-zero observation case $i$ has. Hurdle models assume that the residuals of the hurdle component (here the zero model) and the outcome model (here the count model) are uncorrelated. This assumption is plausible if it is quite distinct groups who form the zero versus non-zero category.

An example, where a hurdle model might be an appropriate analyis model, is the count of cigarettes smoked per day. Non-smokers will quite certainly have a zero count, while smokers will vary in their number of cigarettes smoked per day. Furthermore, there might be different sets of variables that will predict, if persons are smokers or non-smokers, while other variables will predict, how many cigarettes a smoker will consume per day.

To test the new hurdle model imputation functions, we created 1000 data sets with sample sizes $N\in\{500,1000,10000\}$ respectively that contained four variables: $y$ the dependent count variable that has an excess number of zeros, and continuous predictors $x1$, $x2$, and $x3$.

$x_1$–$x_3$ were drawn from normal $\mathcal{N}(0,1)$ populations. $\mu_{0i}$ was the individual probability of belonging to the count versus the zero-group and was determined by a logit model

\[\begin{equation} \mu_{0i}=\text{invlogit}(2\cdot x_3) \end{equation}\]

The respective observation in $y$ was drawn from a binomial distribution with size 1 and probabiliy $\mu_{0i}$. Finally, a “1” in $y$ was replaced by a draw from a zero-truncated Poisson distribution with mean $\exp(1+.3\cdot x_{1i}+.3\cdot x_{2i})$ in the Poisson conditions or from a zero-truncated negative binomial distribution with mean $\exp(1+.3\cdot x_{1i}+.3\cdot x_{2i})$ and overdispersion parameter $\theta=1$ in the negative binomial conditions. MAR missingness was then introduced as outlined in Kleinke & Reinecke (2013). Missing data were imputed $m=5$ times using both the Bayesian regression and the bootstrap regression variants of our hurdle model imputation functions. Function hurdle from package pscl was used to fit the corresponding hurdle model to the completed data sets. Results of the Poisson conditions are displayed in Tables 4.3 and 4.4. $\beta$ are the coefficients in the count part of the model, $\gamma$ the coefficients in the zero model. $Q$ are the simulated true population quantities, $\widehat{Q}$ the Monte Carlo estimates, $SD_{\widehat{Q}}$ is the standard deviation of the estimates across the 1000 replications. We also report %Bias and 95% coverage rates. As can be seen in Tables 4.3 and 4.4, estimates are always very close to the true population quantities, bias is very small and coverage is adequate for all parameters regardless of sample size.

Table 4.3: Performance of MI based on a Bayesian regression hurdle Poisson model
N	Parameter	$\beta_0$	$\beta_1$	$\beta_2$	$\gamma_0$	$\gamma_1$
500	$\widehat{Q}$	0.996	0.295	0.294	-0.005	2.028
500	$SD_{\widehat{Q}}$	0.052	0.049	0.050	0.145	0.228
500	Bias	0.004	0.005	0.006	0.005	-0.028
500	Coverage Rate	96.700	95.300	95.800	95.300	94.900
1000	$\widehat{Q}$	0.997	0.297	0.296	-0.003	2.014
1000	$SD_{\widehat{Q}}$	0.039	0.034	0.036	0.101	0.163
1000	Bias	0.003	0.003	0.004	0.003	-0.014
1000	Coverage Rate	94.600	95.200	95.200	95.800	95.400
10000	$\widehat{Q}$	1.000	0.299	0.300	-0.001	2.004
10000	$SD_{\widehat{Q}}$	0.012	0.011	0.011	0.032	0.048
10000	Bias	0.000	0.001	0.000	0.001	-0.004
10000	Coverage Rate	94.400	94.000	96.100	95.200	94.900

Table 4.4: Performance of MI based on a bootstrap regression hurdle Poisson model
N	Parameter	$\beta_0$	$\beta_1$	$\beta_2$	$\gamma_0$	$\gamma_1$
500	$\widehat{Q}$	0.994	0.293	0.293	-0.007	2.037
500	$SD_{\widehat{Q}}$	0.052	0.049	0.049	0.147	0.234
500	Bias	0.006	0.007	0.007	0.007	-0.037
500	Coverage Rate	95.400	95.300	96.200	95.100	95.000
1000	$\widehat{Q}$	0.996	0.295	0.295	-0.005	2.017
1000	$SD_{\widehat{Q}}$	0.039	0.034	0.035	0.101	0.164
1000	Bias	0.004	0.005	0.005	0.005	-0.017
1000	Coverage Rate	94.600	95.200	95.100	95.500	93.800
10000	$\widehat{Q}$	1.000	0.299	0.300	-0.001	2.004
10000	$SD_{\widehat{Q}}$	0.012	0.011	0.011	0.032	0.049
10000	Bias	0.000	0.001	0.000	0.001	-0.004
10000	Coverage Rate	93.900	94.100	94.400	94.200	95.300

Table 4.5: Performance of MI based on a Bayesian regression hurdle NB model
N	Parameter	$\beta_0$	$\beta_1$	$\beta_2$	$\gamma_0$	$\gamma_1$
500	$\widehat{Q}$	0.990	0.284	0.290	0.002	2.025
500	$SD_{\widehat{Q}}$	0.132	0.095	0.094	0.142	0.230
500	Bias	0.010	0.016	0.010	-0.002	-0.025
500	Coverage Rate	94.900	93.500	94.400	95.200	95.100
1000	$\widehat{Q}$	0.992	0.289	0.290	-0.004	2.019
1000	$SD_{\widehat{Q}}$	0.098	0.065	0.064	0.100	0.162
1000	Bias	0.008	0.011	0.010	0.004	-0.019
1000	Coverage Rate	94.400	94.400	93.800	95.900	96.400
10000	$\widehat{Q}$	1.000	0.298	0.299	0.001	2.003
10000	$SD_{\widehat{Q}}$	0.030	0.021	0.021	0.032	0.050
10000	Bias	0.000	0.002	0.001	-0.001	-0.003
10000	Coverage Rate	94.200	95.000	95.500	93.900	94.700

Table 4.6: Performance of MI based on a bootstrap regression hurdle NB model
N	Parameter	$\beta_0$	$\beta_1$	$\beta_2$	$\gamma_0$	$\gamma_1$
500	$\widehat{Q}$	0.992	0.280	0.288	0.002	2.039
500	$SD_{\widehat{Q}}$	0.134	0.094	0.093	0.145	0.230
500	Bias	0.008	0.020	0.012	-0.002	-0.039
500	Coverage Rate	94.600	92.900	92.800	94.800	94.600
1000	$\widehat{Q}$	0.991	0.285	0.289	-0.003	2.024
1000	$SD_{\widehat{Q}}$	0.096	0.065	0.066	0.101	0.164
1000	Bias	0.009	0.015	0.011	0.003	-0.024
1000	Coverage Rate	95.400	94.300	94.200	95.800	94.400
10000	$\widehat{Q}$	1.000	0.297	0.298	0.001	2.004
10000	$SD_{\widehat{Q}}$	0.030	0.021	0.020	0.032	0.050
10000	Bias	0.000	0.003	0.002	-0.001	-0.004
10000	Coverage Rate	93.600	93.800	95.300	95.300	94.600

Results of the NB conditions are displayed in Tables 4.5 and 4.6. Again parameter estimates are typically accurate and coverage is fine for all parameters. In the small sample size condition, with $N=500$, overdispersion parameter $\theta$ was slightly overestimated (1.149 for the Bayesian regression method and 1.187 for the bootstrap regression method respectively).

With $N=1000$, overdispersion parameter $\theta$ was 1.082 for the Bayesian regression method, and 1.102 for the bootstrap regression method. When $N=10000$, overdispersion parameter $\theta$ was 1.010 for the Bayesian regression method, and 1.013 for the bootstrap regression method. In comparison, the mean estimate for $\theta$ of the complete data sets – i.e. before any data in $y$ were deleted – was 1.056.

All in all, results suggest that the imputation methods work sufficiently well.

4.8 Imputation of two-level Poisson distributed data

mice.impute.2l.poisson() imputes missing data based on a generalized linear mixed effects Poisson model. Note that the function requires the data to be in long format, i.e. the repeated measurements are stacked upon another. Additional time and id variables indicate measurement timepoint (when the data are panel data) and cluster membership. The function is described in detail in Kleinke & Reinecke (2015 a). To specify the imputation model, we use the same coding scheme as other two-level imputation functions in mice. Allowed entries in the predictorMatrix are given in Table 4.7:

Table 4.7: Allowed entries in the `predictorMatrix`
numeric code	meaning
0	variable is not included in the imputation model
1	variable is included as a fixed effect
2	variable is included as a fixed and random effect
-2	class variable.

The intercept is always treated as fixed and random factor. If this is not desired, the .noint variant of the function could be used, which treats the intercept only as a fixed, but not as a random effect (see help("mice.impute.2l.poisson")).

4.9 Imputation of overdispersed two-level count data

Method 2l.nb imputes incomplete data based on a two-level NB model (for details, see Kleinke & Reinecke, 2015 b). In version 2 of package countimp, we have replaced package glmmADMB by glmmTMB to estimate the two-level NB model. In our practical experience, estimation by glmmTMB is faster and more stable. Note that the function requires data in long format and uses the same coding scheme as other two-level imputation functions in (cf. van Buuren & Groothuis-Oudshoorn, 2011). The specification of the imputation method and model works analogous to other multilevel imputation functions in mice, see for example help("mice.impute.2l.pan"). Allowed entries in the predictorMatrix are listed in Table 4.8.

Table 4.8: Allowed entries in the `predictorMatrix`
numeric code	meaning
0	variable is not included in the imputation model
1	variable is included as a fixed effect
2	variable is included as a fixed and random effect
-2	class variable.

4.9.1 Monte Carlo Simulation

We tested the refurbished version of mice.impute.2l.nb by means of Monte Carlo simulation (with 200 replications). The design was very simular to the one used by Kleinke & Reinecke (2015 b).

We simulated data of $g=50$ groups of varying group sizes $n_j$, $j\in (1,2,\dots,g)$.

$n_j$ were drawn randomly from all integer numbers between 100 and 400. The data of the 50 groups were generated separately and then stacked upon each other to create data sets in long format. Each data set consisted of one continuous individual level predictor $x$, one continuous group level predictor $z$, and the overdispersed dependent count variable $y$. The intercept and the slope of $x$ were treated as random factors (i.e. they were allowed to vary across the 50 groups). We set the following population parameters: $\beta_0=1$, the fixed intercept term, $\beta_1=.75$, the coefficient of the individual level predictor $x$, $\beta_2=.5$, the coefficient of the group level predictor $z$, and $\theta=2$, the overdispersion parameter. Random effects standard deviations were $.5$ and $.3$ for the intercept term ($\tau_{00}$) and the slope of $x$ ($\tau_{11}$) respectively. The random effects correlation ($\tau_{01}$) was set to $0$. $x_j$, $z_j$, and the individual level residuals $e_j$ were simulated from $\mathcal{N}(0,1)$ normal populations. Random effects were also drawn from normal $\mathcal{N}(0,\tau)$ populations. Data were generated as follows.

We first obtained parameters

\[\begin{equation*} {\beta_{j}}={\Gamma}{Z_{j}}+{u_{j}} \end{equation*}\]

where

\[\begin{equation*} {\beta_{j}}=\begin{bmatrix} {\beta_{0j}} \\ {\beta_{1j}} \end{bmatrix}, {\Gamma} = \begin{bmatrix} \gamma_{00} & \gamma_{01} \\ \gamma_{10} & \gamma_{11} \\ \end{bmatrix} = \begin{bmatrix} 1 & .5 \\ 0.75 & 0 \\ \end{bmatrix}, {Z_j}=\begin{bmatrix} 1 \\ z_{j} \end{bmatrix}, \text{and } {u_{j}}=\begin{bmatrix} {u_{0j}} \\ {u_{1j}} \end{bmatrix}. \end{equation*}\]

For each individual $i$, the dependent count variable $y_{ij}$ was then drawn from $y_{ij}\sim\text{NB}(\mu_{ij}=\exp(X_{ij}\beta_j),\theta=2)$, using the rnegbin() function from R package MASS, where $X_{ij}=\begin{bmatrix}1&x_{ij}\end{bmatrix}$ .

In the next step, we introduced missing data in $y$. Predictors $x$, and $z$ remained fully observed. We introduced missing data according to the following rule, where $p_{ij}$ is the missingness probability and NA denotes a missing value (i.e. “not available”):

\[\begin{align*} p_{ij}&=\text{invlogit}(-1+z_{ij}) \\ v_{ij}&\sim \mathcal{U}(0,1) \\ y_{ij} &= \text{NA, } \text{if } v_{ij}<p_{ij}. \end{align*}\]

This created an average percentage of missing data in $y$ of 30.2% ($SD=3.0\%$).

The incomplete variable $y$ was then imputed $m=5$ times using both the Bayesian regression and bootstrap regression versions of the function. The imputation model included variables $x$ and $z$. The intercept and the slope of $x$ were treated as random factors. We used the mice default settings and monitored convergence as outlined in van Buuren & Groothuis-Oudshoorn (2011). The data sets were then analyzed by the same NB2 model that was used for data imputation. Results were combined using Rubin’s formula (Rubin, 1987). Results are displayed in Tables 4.9 and 4.10. The tables display (a) the average parameter estimate across the 200 replications, (b) bias, which is defined as the difference between the true parameter and the average parameter estimate, (c) %Bias, which is bias, devided by the true parameter multiplied by 100%, and 95% coverage (i.e. the percentage of confidence intervals that include the true parameter).

Table 4.9: Performance of MI based on a Bayesian regression two-level NB2 model
Parameter	$\beta_0$	$\beta_1$	$\beta_2$	$\sigma_{00}$	$\sigma_{11}$	$\sigma_{01}$	$\theta$
Q	1.000	0.750	0.500	0.500	0.300	0.000	2.000
$\widehat{Q}$	0.996	0.748	0.503	0.471	0.294	-0.088	1.961
Bias	0.004	0.002	-0.003	0.029	0.006	0.088	0.039
%Bias	0.400	0.267	-0.600	5.800	2.000	Inf	1.950
Coverage Rate	93.500	95.500	94.500	NA	NA	NA	NA

Table 4.10: Performance of MI based on a bootstrap regression two-level NB2 model
Parameter	$\beta_0$	$\beta_1$	$\beta_2$	$\sigma_{00}$	$\sigma_{11}$	$\sigma_{01}$	$\theta$
Q	1.000	0.750	0.500	0.500	0.300	0.000	2.000
$\widehat{Q}$	0.996	0.748	0.504	0.472	0.295	-0.102	1.977
Bias	0.004	0.002	-0.004	0.028	0.005	0.102	0.023
%Bias	0.400	0.267	-0.800	5.600	1.667	Inf	1.150
Coverage Rate	94.000	95.500	93.500	NA	NA	NA	NA

As can be seen, the average parameter estimates are close to the defined population quantities, %Bias is below 10% and coverage is above 90% for all parameters, which means that the imputation function works nicely when (like in this case), model assumptions are met.

4.10 Imputation of zero-inflated two-level count data

Methods "2l.hp" and "2l.hnb" impute zero-inflated two-level count data based on a two-level hurdle model. Analogously, methods "2l.zip" and "2l.zinb" create imputations based on a two-level zero-inflated Poisson or NB model respectively. The .noint variants of the functions treat the intercept only as a fixed, but not as a random effect. It may be specified, if the intercept is excluded from the random part of the zero model (.noint.zero), the random part of the count model (.noint.count), or from the random part of both models (.noint.both). For details about all model variants, see help("mice.impute.2l.hnb"

Different sets of predictors can be specified for the zero and count parts of the model. Allowed entries in the predictorMatrix are listed in Table 4.11:

Table 4.11: Allowed entries in the `predictorMatrix`
numeric code	meaning
0	variable is not included in the imputation model
1	variable is included as a fixed effect
2	variable is included as a fixed and random effect
3	variable is included as a fixed effect (count model only)
4	variable is included as a fixed and random effect (count model only)
5	variable is included as a fixed effect (zero model only)
6	variable is included as a fixed and random effect (zero model only)
-2	class variable

4.10.1 Monte Carlo Simulation

We evaluated these new or updated imputation procedures for zero-inflated two-level data by means of Monte Carlo simulation (with 200 replications). The design was similar to the one used by Kleinke & Reinecke (2015 b): We simulated data of $g=50$ groups of sample sizes $n_j=100$, $j \in 1\dots,g$. The data of the 50 groups were generated separately and then stacked upon each other to create data sets in long format. Each data set consisted of one continuous individual level predictor $x$, and the overdispersed dependent count variable $y$. $x$ both influenced the zero and the count process. The intercept and the slope of $x$ were treated as random in both the zero and the count part of the model. We set the following population parameters: $\beta_0=1$, and $\beta_1=.75$ the fixed intercept and slope terms respectively of the count part of the model, and $\gamma_0=.0$, and $\gamma_1=.5$ the fixed intercept and slope terms in the zero part of the model. Random effects standard deviations were $.5$ and $.3$ for the intercept term ($\tau_{00}$) and the slope of $x$ ($\tau_{11}$) respectively. The random effects correlation ($\tau_{01}$) was set to $0$. $x_{ij}$ were simulated from a $\mathcal{N}(0,1)$ normal population. Random effects $u$ were also drawn from normal $\mathcal{N}(0,\tau)$ populations. Data were generated as follows. We first obtained parameters for the zero part and the count part of the model. For each group $j$, $j\in{1,\dots,g}$ we simulated parameters:

\[\begin{align*} \gamma_{0j}&=0+u_{0jz}\\ \gamma_{1j}&=.5+u_{1jz}\\ \beta_{0j}&=1+u_{0jc}\\ \beta_{1j}&=.75+u_{1jc} \end{align*}\]

The probability of belonging to the zero versus non-zero category was then determined by

\[\begin{equation*} \mu_{ij}=\text{invlogit}(\gamma_{0j}+\gamma_{1j}x_{1i}) \end{equation*}\]

We then drew random uniform $z\sim\mathcal{U}(0,1)$ numbers and if $z_{ij}<\mu_{ij}$ then the dependent variable $y$ was set to 0, and 1 otherwise. In the next step, a nonzero observation in $y$ was replaced by a random draw from a zero-truncated Poisson distribution with mean $\exp(\beta_{0j}+\beta_{1j}x_{1ij})$ in the Poisson condition, and by a random draw from a zero-truncated NB distribution with mean $\exp(\beta_{0j}+\beta_{1j} x_{1ij})$ and overdispersion parameter $\theta=2$ in the NB condition.

Then, we introduced missing data in $y$. Predictor $x$ remained fully observed. We introduced missing data according to the following rule, where $p_{ij}$ is the missingness probability and NA denotes a missing value (i.e. “not available”):

\[\begin{align*} p_{ij}&=\text{invlogit}(-1+x_{ij}) \\ v_{ij}&\sim \mathcal{U}(0,1) \\ y_{ij} &= \text{NA, } \text{if } v_{ij}<p_{ij}. \end{align*}\]

This created an average percentage of missing data in $y$ of 30.4%.

The incomplete variable $y$ was then imputed $m=5$ times using both the Bayesian regression and bootstrap regression versions of the respective function (hurdle Poisson imputation in the Poisson condition, and hurdle NB imputation in the negative binomial condition). $x$ was included as predictor for the zero and count part of the model. The intercept and the slope of $x$ were treated as random factors (again in both parts of the model). We used the mice default settings and monitored convergence (for details, see Chapter 5). The $m$ completed data sets were then analyzed by the same model that was used for data imputation. Results were combined using Rubin’s formula (Rubin, 1987).

Results of hurdle Poisson imputation are displayed in Tables 4.12 and 4.13, results of hurdle NB imputation are displayed in Tables 4.14 and 4.15. The tables display the average parameter estimate across the 200 replications, the standard deviation of the estimates across the replications, bias, percent bias and 95% coverage.

Table 4.12: Performance of MI based on a Bayesian regression two-level hurdle Poisson model
	$Q$	$\widehat{Q}$	Bias	%Bias	Coverage Rate
$\beta_0$	1.00	1.001	-0.001	-0.100	94.924
$\beta_1$	0.75	0.746	0.004	0.533	93.909
$\gamma_0$	0.00	-0.006	0.006	Inf	92.893
$\gamma_1$	0.50	0.492	0.008	1.600	96.447
$\sigma_{00c}$	0.50	0.491	0.009	1.800	NA
$\sigma_{11c}$	0.30	0.283	0.017	5.667	NA
$\sigma_{01c}$	0.00	-0.008	0.008	Inf	NA
$\sigma_{00z}$	0.50	0.477	0.023	4.600	NA
$\sigma_{11z}$	0.30	0.271	0.029	9.667	NA
$\sigma_{01z}$	0.00	-0.053	0.053	Inf	NA

Table 4.13: Performance of MI based on a bootstrap regression two-level hurdle Poisson model
	$Q$	$\widehat{Q}$	Bias	%Bias	Coverage Rate
$\beta_0$	1.00	1.011	-0.011	-1.100	94.416
$\beta_1$	0.75	0.748	0.002	0.267	96.447
$\gamma_0$	0.00	-0.006	0.006	Inf	91.878
$\gamma_1$	0.50	0.495	0.005	1.000	91.878
$\sigma_{00c}$	0.50	0.446	0.054	10.800	NA
$\sigma_{11c}$	0.30	0.261	0.039	13.000	NA
$\sigma_{01c}$	0.00	-0.127	0.127	Inf	NA
$\sigma_{00z}$	0.50	0.439	0.061	12.200	NA
$\sigma_{11z}$	0.30	0.269	0.031	10.333	NA
$\sigma_{01z}$	0.00	-0.108	0.108	Inf	NA

Table 4.14: Performance of MI based on a Bayesian regression two-level hurdle NB model
	$Q$	$\widehat{Q}$	Bias	%Bias	Coverage Rate
$\beta_0$	1.00	0.993	0.007	0.700	93.939
$\beta_1$	0.75	0.735	0.015	2.000	92.424
$\gamma_0$	0.00	-0.003	0.003	Inf	93.939
$\gamma_1$	0.50	0.497	0.003	0.600	96.970
$\sigma_{00c}$	0.50	0.480	0.020	4.000	NA
$\sigma_{11c}$	0.30	0.276	0.024	8.000	NA
$\sigma_{01c}$	0.00	-0.059	0.059	Inf	NA
$\sigma_{00z}$	0.50	0.485	0.015	3.000	NA
$\sigma_{11z}$	0.30	0.274	0.026	8.667	NA
$\sigma_{01z}$	0.00	-0.054	0.054	Inf	NA
$\theta$	2.00	2.098	-0.098	-4.900	NA

Table 4.15: Performance of MI based on a bootstrap regression two-level hurdle NB model
	$Q$	$\widehat{Q}$	Bias	%Bias	Coverage Rate
$\beta_0$	1.00	0.996	0.004	0.400	92.929
$\beta_1$	0.75	0.743	0.007	0.933	93.939
$\gamma_0$	0.00	-0.001	0.001	Inf	93.434
$\gamma_1$	0.50	0.500	0.000	0.000	95.455
$\sigma_{00c}$	0.50	0.440	0.060	12.000	NA
$\sigma_{11c}$	0.30	0.262	0.038	12.667	NA
$\sigma_{01c}$	0.00	-0.151	0.151	Inf	NA
$\sigma_{00z}$	0.50	0.444	0.056	11.200	NA
$\sigma_{11z}$	0.30	0.268	0.032	10.667	NA
$\sigma_{01z}$	0.00	-0.127	0.127	Inf	NA
$\theta$	2.00	1.991	0.009	0.450	NA

Estimates are sufficiently accurate. It appears that the bootstrap variants performed slightly worse than the Bayesian variants, especially regarding the random part of the model. This might be caused by the block bootstrap approach implemented here (we re-sample clusters rather than individual cases). This approach requires a substantial number of clusters. Future research needs to establish under which scenarios it is better to use the Baysian variants of the functions and when to use the bootstrap variants. Little is yet known about level-1 and level-2 sample size requirements.

N	Parameter	\(\beta_0\)	\(\beta_1\)	\(\beta_2\)	\(\gamma_0\)	\(\gamma_1\)
500	\(\widehat{Q}\)	0.996	0.295	0.294	-0.005	2.028
500	\(SD_{\widehat{Q}}\)	0.052	0.049	0.050	0.145	0.228
500	Bias	0.004	0.005	0.006	0.005	-0.028
500	Coverage Rate	96.700	95.300	95.800	95.300	94.900
1000	\(\widehat{Q}\)	0.997	0.297	0.296	-0.003	2.014
1000	\(SD_{\widehat{Q}}\)	0.039	0.034	0.036	0.101	0.163
1000	Bias	0.003	0.003	0.004	0.003	-0.014
1000	Coverage Rate	94.600	95.200	95.200	95.800	95.400
10000	\(\widehat{Q}\)	1.000	0.299	0.300	-0.001	2.004
10000	\(SD_{\widehat{Q}}\)	0.012	0.011	0.011	0.032	0.048
10000	Bias	0.000	0.001	0.000	0.001	-0.004
10000	Coverage Rate	94.400	94.000	96.100	95.200	94.900

	\(Q\)	\(\widehat{Q}\)	Bias	%Bias	Coverage Rate
\(\beta_0\)	1.00	1.001	-0.001	-0.100	94.924
\(\beta_1\)	0.75	0.746	0.004	0.533	93.909
\(\gamma_0\)	0.00	-0.006	0.006	Inf	92.893
\(\gamma_1\)	0.50	0.492	0.008	1.600	96.447
\(\sigma_{00c}\)	0.50	0.491	0.009	1.800	NA
\(\sigma_{11c}\)	0.30	0.283	0.017	5.667	NA
\(\sigma_{01c}\)	0.00	-0.008	0.008	Inf	NA
\(\sigma_{00z}\)	0.50	0.477	0.023	4.600	NA
\(\sigma_{11z}\)	0.30	0.271	0.029	9.667	NA
\(\sigma_{01z}\)	0.00	-0.053	0.053	Inf	NA

Chapter 4 countimp