# Chapter 4 `countimp`

## 4.1 A cautionary note

Package `countimp`

creates imputations based on parametric count models. *Plausibilty* of the obtained imputations will depend upon the correct specification of the imputation procedure, i.e. the count data regression model that is used to predict missing information must fit the empirical data well.

Research by Kleinke & Reinecke (2013) suggests that misspecifications of the (count) imputation model can bias statistical inferences quite noticeably. We therefore advise researchers to spend sufficient time on modelling. We would like to refer readers to the excellent book by Hilbe (2011) to determine which count model fits the data best.

We would also like to draw attention to R packages `Qtools`

and `ImputeRobust`

, which allow to create imputations based on nonparametric quantile regression and semi-parametric generalized additive models, which might be a suitable alternative in situations, where assumptions of parametric count models are violated.

## 4.2 Count data models

An overview of parametric imputation models available in package `countimp`

is given in Table 4.1. More details on the theoretical background of count regression models is given in Zeileis et al. (2008), and in Hilbe (2011).

Function Name | model / family |
---|---|

`mice.impute.poisson()` |
Poisson |

`mice.impute.quasipoisson()` |
Quasi-Poisson |

`mice.impute.nb()` |
negative binomial |

`mice.impute.zip()` |
zero-inflated Poisson |

`mice.impute.zinb()` |
zero-inflated negative binomial |

`mice.impute.hp()` |
hurdle Poisson |

`mice.impute.hnb()` |
hurdle negative binomial |

`mice.impute.2l.poisson()` |
two-level Poisson |

`mice.impute.2l.nb()` |
two-level negative binomial |

`mice.impute.2l.zip()` |
two-level zero-inflated Poisson |

`mice.impute.2l.zinb()` |
two-level zero-inflated NB |

`mice.impute.2l.hp()` |
two-level hurdle Poisson |

`mice.impute.2l.hnb()` |
two-level hurdle negative binomial |

## 4.3 Installation

The latest version can be installed from GitHub using function `install_git`

from package `devtools`

:

```
devtools::install_git(url = "https://github.com/kkleinke/countimp",
branch = "master")
```

## 4.4 Imputation of incomplete Poisson distributed variables

The function to impute an incomplete Poisson distributed variable is called `mice.impute.poisson()`

. Appending `.boot`

to the function name calls the bootstrap regression variant (see `help("mice.impute.poisson.boot")`

). The functions have the following arguments:

```
library(countimp)
args("mice.impute.poisson")
```

```
## function (y, ry, x, wy = NULL, EV = TRUE, ...)
## NULL
```

Arguments `y`

, `ry`

, `x`

need not be specified. They are automatically obtained from `mice`

. For information on how to use `mice`

’s new `wy`

-argument, and for further information, see `help("mice")`

.

Numeric vector `y`

is the incomplete Poisson distributed variable that shall be imputed. `ry`

is the response indicator of `y`

, with `TRUE`

indicating an observed value and `FALSE`

a missing observation in vector `y`

respectively. `wy`

is a logical vector of `length(y)`

. `TRUE`

statements indicate locations in `y`

for which imputations shall be generated (default is `!ry`

). `x`

is a matrix with `length(y)`

rows containing complete covariates – the variables in the imputation model, which has been specified via `mice`

’s `predictorMatrix`

argument. The variables in `x`

are used to predict missing information in `y`

. Argument `EV`

defines how the function handles extreme imputations. When set to `TRUE`

, the function automatically detects extreme values among the imputations and replaces them by values obtained by predictive mean matching.

We use the standard `glm.fit`

function from package `stats`

to fit the Poisson model to the observed part of the data, i.e. `x[ry,]`

, `y[ry]`

. Imputations are drawn from a Poisson distribution with mean \(\lambda\), which is the value predicted for the respective incomplete case.

### 4.4.1 Example

We generate an example data set with a Poisson distributed dependent variable \(y\) and three continuous predictors \(x1\)–\(x3\) and introduce MAR missingness in the dependent variable.

```
set.seed( 1234 )
b0 <- 1
b1 <- .75
b2 <- -.25
b3 <- .5
N <- 5000
x1 <- rnorm(N)
x2 <- rnorm(N)
x3 <- rnorm(N)
lam <- exp( b0 + b1 * x1 + b2 * x2 + b3 * x3 )
y <- rpois( N, lam )
POIS <- data.frame( y, x1, x2, x3 )
## introduce MAR missingness to simulated data
generate.md <- function( data, pos = 1, Z = 2, pmis = .5, strength = c( .5, .5 ) )
{
total <- round( pmis * nrow(data) )
sm <- which( data[,Z] < mean( data[,Z] ) )
gr <- which( data[,Z] > mean( data[,Z] ) )
sel.sm <- sample( sm, round( strength[1] * total ) )
sel.gr <- sample( gr, round( strength[2] * total ) )
sel <- c( sel.sm, sel.gr )
data[sel,pos] <- NA
return(data)
}
MPOIS <- generate.md( POIS, pmis = .2, strength = c( .2, .8) )
```

This procedure generates 20% missing data in `y`

. Missingness depends on variable `x1`

. Cases whose `x1`

-value is above the mean of `x1`

are more likely to have a missing value in `y`

. The resulting incomplete data set is called `MPOIS`

.

Imputations are then generated using wrapper function `countimp()`

, which passes the arguments on to `mice`

.

To select function `mice.impute.poisson()`

to impute missing data in \(y\), the respective slot in `method`

has to be set to `"poisson"`

. See `help("mice")`

for details.

`colnames(MPOIS)`

`## [1] "y" "x1" "x2" "x3"`

`MPOIS`

has four variables. Thus, `method`

has to be a vector of length 4. `y`

is the first variable in `MPOIS`

. We therefore need to set the first entry in `method`

to `"poisson"`

(or `"poisson.boot"`

for the bootstrap variant). The other variables are completely observed and get method `""`

.

```
imp <- countimp( MPOIS, method = c( "poisson" ,"" ,"" ,"" ), print=FALSE)
impb <- countimp( MPOIS, method = c( "poisson.boot" ,"" ,"" ,"" ), print=FALSE)
```

Note that if `predictorMatrix`

is not specified, all other variables are used to predict missing information in `y`

.

The resulting object `imp`

is of class `mids`

(a multiply imputed data set). Standard `mice`

functions like `with()`

and `pool()`

can be applied to that object to automate repeated data analysis and pooling of results.

```
repeated.analysis <- with(imp, glm(y~x1+x2+x3,family = poisson))
summary(pool(repeated.analysis))
```

```
## est se t df Pr(>|t|)
## (Intercept) 0.9779894 0.009909391 98.69320 700.87521 0
## x1 0.7507701 0.007478281 100.39341 1030.71997 0
## x2 -0.2342093 0.008082908 -28.97588 91.46838 0
## x3 0.4846562 0.008227717 58.90531 55.51060 0
## lo 95 hi 95 nmis fmi lambda
## (Intercept) 0.9585338 0.9974451 NA 0.07225868 0.06961507
## x1 0.7360957 0.7654445 0 0.05690214 0.05507394
## x2 -0.2502639 -0.2181547 0 0.22348656 0.20669134
## x3 0.4681709 0.5011415 0 0.29147133 0.26639537
```

For details about the respective slots of the output, see van Buuren & Groothuis-Oudshoorn (2011).

## 4.5 Imputation of overdispersed count data

`countimp`

has two methods to impute overdispersed count:

- imputation based on a quasi-Poisson model, and
- imputation based on a negative binomial model.

Both methods can be used when the equidispersion assumption of the standard Poisson model is violated, i.e. when the variance of the count variable is larger in comparison to its mean.

### 4.5.1 Imputation based on a quasi-Poisson model

Method `"quasipoisson"`

imputes univariate missing data based on a Bayesian quasi-Poisson GLM using the standard R function `glm.fit`

(i.e. using `family = quasipoisson(link = log)`

). Imputations are simulated from a negative binomial distribution to ensure an adequate dispersion of imputed values: The quasi-Poisson model estimates Pearson dispersion \(\delta\) that gives the amount of overdispersion in the data. Pearson dispersion \(\delta\) is stored in slot `$dispersion`

of the output of the `summary.glm`

function. We supply this parameter to the `rnbinom()`

function from package `stats`

to obtain the imputations. For details about the parametrization of the NB distribution used by `rnbinom`

, see `help("rnbinom")`

.

### 4.5.2 Example

We simulate an overdispersed count variable `y`

and generate about 20% missing data in this variable. Imputation, repeated data analysis and pooling of results works analogous to Section 4.4.1.

```
## simulate overdispersed count data
set.seed( 1234 )
b0 <- 1
b1 <- .75
b2 <- -.25
b3 <- .5
N <- 5000
x1 <- rnorm(N)
x2 <- rnorm(N)
x3 <- rnorm(N)
mu <- exp( b0 + b1 * x1 + b2 * x2 + b3 * x3 )
y <- MASS::rnegbin( N, theta = 2, mu )
NB <- data.frame( y, x1, x2, x3 )
## introduce MAR missingness to simulated data
total <- round( .2 * N ) ##number of missing data in y
sm <- which( NB[,2] < mean( NB[,2] ) ) ##subset: cases with x2<mean(x2)
gr <- which( NB[,2] > mean( NB[,2] ) ) ##subset: cases with x2>mean(x2)
sel.sm <- sample( sm, round( .2 * total ) ) ##select cases to set as missing
sel.gr <- sample( gr, round( .8 * total ) ) ##select cases to set as missing
sel <- c( sel.sm,sel.gr )
MNB <- NB
MNB[sel,1] <- NA ##delete selected data
## impute missing data
imp <- countimp( MNB, method = c( "quasipoisson", "", "", "" ), print = FALSE)
## repeated data analysis and pooling of results
repeated.analysis <- with(imp, glm(y~x1+x2+x3,family=quasipoisson))
summary(pool(repeated.analysis))
```

```
## est se t df Pr(>|t|) lo 95
## (Intercept) 1.0075246 0.01613825 62.43084 4358.0686 0 0.9758854
## x1 0.7450712 0.01338057 55.68306 163.5364 0 0.7186503
## x2 -0.2469628 0.01274407 -19.37865 493.9668 0 -0.2720021
## x3 0.4555993 0.01448504 31.45310 38.6698 0 0.4262926
## hi 95 nmis fmi lambda
## (Intercept) 1.0391638 NA 0.01086766 0.01041384
## x1 0.7714922 0 0.16350855 0.15334069
## x2 -0.2219235 0 0.08866686 0.08498446
## x3 0.4849061 0 0.35243277 0.31978490
```

## 4.6 Imputation based on a negative binomial model

Method `"nb"`

fits a negative binomial model using function `glm.nb()`

from package `MASS`

and simulates the imputations from a negative binomial distribution, using function `rnegbin()`

from package `MASS`

.

The imputation process works analogous to Section 4.4.1.

## 4.7 Imputation of count data with an excess number of zero counts

Methods `zip`

and `zinb`

impute missing data based on a zero-inflated Poisson or NB model respectively (for details, see Kleinke & Reinecke, 2013). Methods `hp`

and `hnb`

use a hurdle Poisson or NB model.

Users can specify different sets of predictors for the zero and count parts of the models. Allowed entries in the `predictorMatrix`

are listed in Table 4.2.

numeric code | meaning |
---|---|

0 | variable is not included in either model |

1 | variables is included in both the zero and the count model |

2 | count model predictor |

3 | zero-model predictor. |

Note that the current `mice`

version does not allow entries in the `predictorMatrix`

apart from 0 and 1, unless the model is a multilevel model. Two-level imputation models in `mice`

have prefix `.2l`

. Kleinke & Reinecke (2013) therefore used this workaround and labeled the functions `2l.zip`

and `2l.zinb`

– which were included in `countimp`

version 1 to allow the specification of different sets of predictors for the zero and count parts of the model via the `predictorMatrix`

. This however lead to some confusion among empirical researchsers, since these functions did not fit two-level models.

Caution: In version 2 of package`countimp`

, methods`2l.zip`

and`2l.zinb`

in fact fit two-level models – see Section 4.10.

In version 2 of package `countimp`

, we have found a better solution. Wrapper function `countimp`

takes care of the problem. It is therefore important that users call `countimp()`

rather than `mice()`

to create the imputations, when the imputation model is either a zero-inflation model or a hurdle model. `countimp()`

takes the same arguments as `mice()`

and passes them on to `mice`

. Imputations are then created by `mice`

.

### 4.7.1 Monte Carlo Simulation

Kleinke & Reinecke (2013) have tested the zero-inflation functions (methods `"2l.zip"`

and `"2l.zinb"`

) in various scenarios and found that the methods performed adequately well, when the empricial data were in fact zero-inflated Poisson or negative binomial. Here, we use a highly similar set-up to evaluate the new hurdle model variants of these functions, which are availabe in version 2 of the `countimp`

package.

Note that unlike zero-inflation models, hurdle models allow only one source of zeros. The zero model determines, if case \(i\) has a zero or non-zero observation and the count model determines, which non-zero observation case \(i\) has. Hurdle models assume that the residuals of the hurdle component (here the zero model) and the outcome model (here the count model) are uncorrelated. This assumption is plausible if it is quite distinct groups who form the zero versus non-zero category.

An example, where a hurdle model might be an appropriate analyis model, is the count of cigarettes smoked per day. *Non-smokers* will quite certainly have a zero count, while *smokers* will vary in their number of cigarettes smoked per day. Furthermore, there might be different sets of variables that will predict, if persons are smokers or non-smokers, while other variables will predict, how many cigarettes a smoker will consume per day.

To test the new hurdle model imputation functions, we created 1000 data sets with sample sizes \(N\in\{500,1000,10000\}\) respectively that contained four variables: \(y\) the dependent count variable that has an excess number of zeros, and continuous predictors \(x1\), \(x2\), and \(x3\).

\(x_1\)–\(x_3\) were drawn from normal \(\mathcal{N}(0,1)\) populations. \(\mu_{0i}\) was the individual probability of belonging to the count versus the zero-group and was determined by a logit model

\[\begin{equation} \mu_{0i}=\text{invlogit}(2\cdot x_3) \end{equation}\]The respective observation in \(y\) was drawn from a binomial distribution with size 1 and probabiliy \(\mu_{0i}\). Finally, a “1” in \(y\) was replaced by a draw from a zero-truncated Poisson distribution with mean \(\exp(1+.3\cdot x_{1i}+.3\cdot x_{2i})\) in the Poisson conditions or from a zero-truncated negative binomial distribution with mean \(\exp(1+.3\cdot x_{1i}+.3\cdot x_{2i})\) and overdispersion parameter \(\theta=1\) in the negative binomial conditions. MAR missingness was then introduced as outlined in Kleinke & Reinecke (2013). Missing data were imputed \(m=5\) times using both the Bayesian regression and the bootstrap regression variants of our hurdle model imputation functions. Function `hurdle`

from package `pscl`

was used to fit the corresponding hurdle model to the completed data sets. Results of the Poisson conditions are displayed in Tables 4.3 and 4.4. \(\beta\) are the coefficients in the count part of the model, \(\gamma\) the coefficients in the zero model. \(Q\) are the simulated *true* population quantities, \(\widehat{Q}\) the Monte Carlo estimates, \(SD_{\widehat{Q}}\) is the standard deviation of the estimates across the 1000 replications. We also report %Bias and 95% coverage rates. As can be seen in Tables 4.3 and 4.4, estimates are always very close to the true population quantities, bias is very small and coverage is adequate for all parameters regardless of sample size.

N | Parameter | \(\beta_0\) | \(\beta_1\) | \(\beta_2\) | \(\gamma_0\) | \(\gamma_1\) |
---|---|---|---|---|---|---|

500 | \(\widehat{Q}\) | 0.996 | 0.295 | 0.294 | -0.005 | 2.028 |

500 | \(SD_{\widehat{Q}}\) | 0.052 | 0.049 | 0.050 | 0.145 | 0.228 |

500 | Bias | 0.004 | 0.005 | 0.006 | 0.005 | -0.028 |

500 | Coverage Rate | 96.700 | 95.300 | 95.800 | 95.300 | 94.900 |

1000 | \(\widehat{Q}\) | 0.997 | 0.297 | 0.296 | -0.003 | 2.014 |

1000 | \(SD_{\widehat{Q}}\) | 0.039 | 0.034 | 0.036 | 0.101 | 0.163 |

1000 | Bias | 0.003 | 0.003 | 0.004 | 0.003 | -0.014 |

1000 | Coverage Rate | 94.600 | 95.200 | 95.200 | 95.800 | 95.400 |

10000 | \(\widehat{Q}\) | 1.000 | 0.299 | 0.300 | -0.001 | 2.004 |

10000 | \(SD_{\widehat{Q}}\) | 0.012 | 0.011 | 0.011 | 0.032 | 0.048 |

10000 | Bias | 0.000 | 0.001 | 0.000 | 0.001 | -0.004 |

10000 | Coverage Rate | 94.400 | 94.000 | 96.100 | 95.200 | 94.900 |

N | Parameter | \(\beta_0\) | \(\beta_1\) | \(\beta_2\) | \(\gamma_0\) | \(\gamma_1\) |
---|---|---|---|---|---|---|

500 | \(\widehat{Q}\) | 0.994 | 0.293 | 0.293 | -0.007 | 2.037 |

500 | \(SD_{\widehat{Q}}\) | 0.052 | 0.049 | 0.049 | 0.147 | 0.234 |

500 | Bias | 0.006 | 0.007 | 0.007 | 0.007 | -0.037 |

500 | Coverage Rate | 95.400 | 95.300 | 96.200 | 95.100 | 95.000 |

1000 | \(\widehat{Q}\) | 0.996 | 0.295 | 0.295 | -0.005 | 2.017 |

1000 | \(SD_{\widehat{Q}}\) | 0.039 | 0.034 | 0.035 | 0.101 | 0.164 |

1000 | Bias | 0.004 | 0.005 | 0.005 | 0.005 | -0.017 |

1000 | Coverage Rate | 94.600 | 95.200 | 95.100 | 95.500 | 93.800 |

10000 | \(\widehat{Q}\) | 1.000 | 0.299 | 0.300 | -0.001 | 2.004 |

10000 | \(SD_{\widehat{Q}}\) | 0.012 | 0.011 | 0.011 | 0.032 | 0.049 |

10000 | Bias | 0.000 | 0.001 | 0.000 | 0.001 | -0.004 |

10000 | Coverage Rate | 93.900 | 94.100 | 94.400 | 94.200 | 95.300 |

N | Parameter | \(\beta_0\) | \(\beta_1\) | \(\beta_2\) | \(\gamma_0\) | \(\gamma_1\) |
---|---|---|---|---|---|---|

500 | \(\widehat{Q}\) | 0.990 | 0.284 | 0.290 | 0.002 | 2.025 |

500 | \(SD_{\widehat{Q}}\) | 0.132 | 0.095 | 0.094 | 0.142 | 0.230 |

500 | Bias | 0.010 | 0.016 | 0.010 | -0.002 | -0.025 |

500 | Coverage Rate | 94.900 | 93.500 | 94.400 | 95.200 | 95.100 |

1000 | \(\widehat{Q}\) | 0.992 | 0.289 | 0.290 | -0.004 | 2.019 |

1000 | \(SD_{\widehat{Q}}\) | 0.098 | 0.065 | 0.064 | 0.100 | 0.162 |

1000 | Bias | 0.008 | 0.011 | 0.010 | 0.004 | -0.019 |

1000 | Coverage Rate | 94.400 | 94.400 | 93.800 | 95.900 | 96.400 |

10000 | \(\widehat{Q}\) | 1.000 | 0.298 | 0.299 | 0.001 | 2.003 |

10000 | \(SD_{\widehat{Q}}\) | 0.030 | 0.021 | 0.021 | 0.032 | 0.050 |

10000 | Bias | 0.000 | 0.002 | 0.001 | -0.001 | -0.003 |

10000 | Coverage Rate | 94.200 | 95.000 | 95.500 | 93.900 | 94.700 |

N | Parameter | \(\beta_0\) | \(\beta_1\) | \(\beta_2\) | \(\gamma_0\) | \(\gamma_1\) |
---|---|---|---|---|---|---|

500 | \(\widehat{Q}\) | 0.992 | 0.280 | 0.288 | 0.002 | 2.039 |

500 | \(SD_{\widehat{Q}}\) | 0.134 | 0.094 | 0.093 | 0.145 | 0.230 |

500 | Bias | 0.008 | 0.020 | 0.012 | -0.002 | -0.039 |

500 | Coverage Rate | 94.600 | 92.900 | 92.800 | 94.800 | 94.600 |

1000 | \(\widehat{Q}\) | 0.991 | 0.285 | 0.289 | -0.003 | 2.024 |

1000 | \(SD_{\widehat{Q}}\) | 0.096 | 0.065 | 0.066 | 0.101 | 0.164 |

1000 | Bias | 0.009 | 0.015 | 0.011 | 0.003 | -0.024 |

1000 | Coverage Rate | 95.400 | 94.300 | 94.200 | 95.800 | 94.400 |

10000 | \(\widehat{Q}\) | 1.000 | 0.297 | 0.298 | 0.001 | 2.004 |

10000 | \(SD_{\widehat{Q}}\) | 0.030 | 0.021 | 0.020 | 0.032 | 0.050 |

10000 | Bias | 0.000 | 0.003 | 0.002 | -0.001 | -0.004 |

10000 | Coverage Rate | 93.600 | 93.800 | 95.300 | 95.300 | 94.600 |

Results of the NB conditions are displayed in Tables 4.5 and 4.6. Again parameter estimates are typically accurate and coverage is fine for all parameters. In the small sample size condition, with \(N=500\), overdispersion parameter \(\theta\) was slightly overestimated (1.149 for the Bayesian regression method and 1.187 for the bootstrap regression method respectively).

With \(N=1000\), overdispersion parameter \(\theta\) was 1.082 for the Bayesian regression method, and 1.102 for the bootstrap regression method. When \(N=10000\), overdispersion parameter \(\theta\) was 1.010 for the Bayesian regression method, and 1.013 for the bootstrap regression method. In comparison, the mean estimate for \(\theta\) of the complete data sets – i.e. before any data in \(y\) were deleted – was 1.056.

All in all, results suggest that the imputation methods work sufficiently well.

## 4.8 Imputation of two-level Poisson distributed data

`mice.impute.2l.poisson()`

imputes missing data based on a generalized linear mixed effects Poisson model. Note that the function requires the data to be in long format, i.e. the repeated measurements are stacked upon another. Additional time and id variables indicate measurement timepoint (when the data are panel data) and cluster membership. The function is described in detail in Kleinke & Reinecke (2015a). To specify the imputation model, we use the same coding scheme as other two-level imputation functions in `mice`

. Allowed entries in the `predictorMatrix`

are given in Table 4.7:

numeric code | meaning |
---|---|

0 | variable is not included in the imputation model |

1 | variable is included as a fixed effect |

2 | variable is included as a fixed and random effect |

-2 | class variable. |

The intercept is always treated as fixed and random factor. If this is not desired, the `.noint`

variant of the function could be used, which treats the intercept only as a fixed, but not as a random effect (see `help("mice.impute.2l.poisson")`

).

## 4.9 Imputation of overdispersed two-level count data

Method `2l.nb`

imputes incomplete data based on a two-level NB model (for details, see Kleinke & Reinecke, 2015b). In version 2 of package `countimp`

, we have replaced package `glmmADMB`

by `glmmTMB`

to estimate the two-level NB model. In our practical experience, estimation by `glmmTMB`

is faster and more stable. Note that the function requires data in long format and uses the same coding scheme as other two-level imputation functions in (cf. van Buuren & Groothuis-Oudshoorn, 2011). The specification of the imputation method and model works analogous to other multilevel imputation functions in `mice`

, see for example `help("mice.impute.2l.pan")`

. Allowed entries in the `predictorMatrix`

are listed in Table 4.8.

numeric code | meaning |
---|---|

0 | variable is not included in the imputation model |

1 | variable is included as a fixed effect |

2 | variable is included as a fixed and random effect |

-2 | class variable. |

### 4.9.1 Monte Carlo Simulation

We tested the refurbished version of `mice.impute.2l.nb`

by means of Monte Carlo simulation (with 200 replications). The design was very simular to the one used by Kleinke & Reinecke (2015b).

We simulated data of \(g=50\) groups of varying group sizes \(n_j\), \(j\in (1,2,\dots,g)\).

\(n_j\) were drawn randomly from all integer numbers between 100 and 400. The data of the 50 groups were generated separately and then stacked upon each other to create data sets in *long format*. Each data set consisted of one continuous individual level predictor \(x\), one continuous group level predictor \(z\), and the overdispersed dependent count variable \(y\). The intercept and the slope of \(x\) were treated as random factors (i.e. they were allowed to vary across the 50 groups). We set the following population parameters: \(\beta_0=1\), the fixed intercept term, \(\beta_1=.75\), the coefficient of the individual level predictor \(x\), \(\beta_2=.5\), the coefficient of the group level predictor \(z\), and \(\theta=2\), the overdispersion parameter. Random effects standard deviations were \(.5\) and \(.3\) for the intercept term (\(\tau_{00}\)) and the slope of \(x\) (\(\tau_{11}\)) respectively. The random effects correlation (\(\tau_{01}\)) was set to \(0\). \(x_j\), \(z_j\), and the individual level residuals \(e_j\) were simulated from \(\mathcal{N}(0,1)\) normal populations. Random effects were also drawn from normal \(\mathcal{N}(0,\tau)\) populations. Data were generated as follows.

We first obtained parameters

\[\begin{equation*} {\beta_{j}}={\Gamma}{Z_{j}}+{u_{j}} \end{equation*}\]where

\[\begin{equation*} {\beta_{j}}=\begin{bmatrix} {\beta_{0j}} \\ {\beta_{1j}} \end{bmatrix}, {\Gamma} = \begin{bmatrix} \gamma_{00} & \gamma_{01} \\ \gamma_{10} & \gamma_{11} \\ \end{bmatrix} = \begin{bmatrix} 1 & .5 \\ 0.75 & 0 \\ \end{bmatrix}, {Z_j}=\begin{bmatrix} 1 \\ z_{j} \end{bmatrix}, \text{and } {u_{j}}=\begin{bmatrix} {u_{0j}} \\ {u_{1j}} \end{bmatrix}. \end{equation*}\]For each individual \(i\), the dependent count variable \(y_{ij}\) was then drawn from \(y_{ij}\sim\text{NB}(\mu_{ij}=\exp(X_{ij}\beta_j),\theta=2)\), using the `rnegbin()`

function from R package `MASS`

, where \(X_{ij}=\begin{bmatrix}1&x_{ij}\end{bmatrix}\) .

In the next step, we introduced missing data in \(y\). Predictors \(x\), and \(z\) remained fully observed. We introduced missing data according to the following rule, where \(p_{ij}\) is the missingness probability and NA denotes a missing value (i.e. “not available”):

\[\begin{align*} p_{ij}&=\text{invlogit}(-1+z_{ij}) \\ v_{ij}&\sim \mathcal{U}(0,1) \\ y_{ij} &= \text{NA, } \text{if } v_{ij}<p_{ij}. \end{align*}\]This created an average percentage of missing data in \(y\) of 30.2% (\(SD=3.0\%\)).

The incomplete variable \(y\) was then imputed \(m=5\) times using both the Bayesian regression and bootstrap regression versions of the function. The imputation model included variables \(x\) and \(z\). The intercept and the slope of \(x\) were treated as random factors. We used the `mice`

default settings and monitored convergence as outlined in van Buuren & Groothuis-Oudshoorn (2011). The data sets were then analyzed by the same NB2 model that was used for data imputation. Results were combined using Rubin’s formula (Rubin, 1987). Results are displayed in Tables 4.9 and 4.10. The tables display (a) the average parameter estimate across the 200 replications, (b) bias, which is defined as the difference between the true parameter and the average parameter estimate, (c) %Bias, which is bias, devided by the true parameter multiplied by 100%, and 95% coverage (i.e. the percentage of confidence intervals that include the true parameter).

Parameter | \(\beta_0\) | \(\beta_1\) | \(\beta_2\) | \(\sigma_{00}\) | \(\sigma_{11}\) | \(\sigma_{01}\) | \(\theta\) |
---|---|---|---|---|---|---|---|

Q | 1.000 | 0.750 | 0.500 | 0.500 | 0.300 | 0.000 | 2.000 |

\(\widehat{Q}\) | 0.996 | 0.748 | 0.503 | 0.471 | 0.294 | -0.088 | 1.961 |

Bias | 0.004 | 0.002 | -0.003 | 0.029 | 0.006 | 0.088 | 0.039 |

%Bias | 0.400 | 0.267 | -0.600 | 5.800 | 2.000 | Inf | 1.950 |

Coverage Rate | 93.500 | 95.500 | 94.500 | NA | NA | NA | NA |

Parameter | \(\beta_0\) | \(\beta_1\) | \(\beta_2\) | \(\sigma_{00}\) | \(\sigma_{11}\) | \(\sigma_{01}\) | \(\theta\) |
---|---|---|---|---|---|---|---|

Q | 1.000 | 0.750 | 0.500 | 0.500 | 0.300 | 0.000 | 2.000 |

\(\widehat{Q}\) | 0.996 | 0.748 | 0.504 | 0.472 | 0.295 | -0.102 | 1.977 |

Bias | 0.004 | 0.002 | -0.004 | 0.028 | 0.005 | 0.102 | 0.023 |

%Bias | 0.400 | 0.267 | -0.800 | 5.600 | 1.667 | Inf | 1.150 |

Coverage Rate | 94.000 | 95.500 | 93.500 | NA | NA | NA | NA |

As can be seen, the average parameter estimates are close to the defined population quantities, %Bias is below 10% and coverage is above 90% for all parameters, which means that the imputation function works nicely when (like in this case), model assumptions are met.

## 4.10 Imputation of zero-inflated two-level count data

Methods `"2l.hp"`

and `"2l.hnb"`

impute zero-inflated two-level count data based on a two-level hurdle model. Analogously, methods `"2l.zip"`

and `"2l.zinb"`

create imputations based on a two-level zero-inflated Poisson or NB model respectively. The `.noint`

variants of the functions treat the intercept only as a fixed, but not as a random effect. It may be specified, if the intercept is excluded from the random part of the zero model (`.noint.zero`

), the random part of the count model (`.noint.count`

), or from the random part of both models (`.noint.both`

). For details about all model variants, see `help("mice.impute.2l.hnb"`

Different sets of predictors can be specified for the zero and count parts of the model. Allowed entries in the `predictorMatrix`

are listed in Table 4.11:

numeric code | meaning |
---|---|

0 | variable is not included in the imputation model |

1 | variable is included as a fixed effect |

2 | variable is included as a fixed and random effect |

3 | variable is included as a fixed effect (count model only) |

4 | variable is included as a fixed and random effect (count model only) |

5 | variable is included as a fixed effect (zero model only) |

6 | variable is included as a fixed and random effect (zero model only) |

-2 | class variable |

### 4.10.1 Monte Carlo Simulation

We evaluated these new or updated imputation procedures for zero-inflated two-level data by means of Monte Carlo simulation (with 200 replications). The design was similar to the one used by Kleinke & Reinecke (2015b): We simulated data of \(g=50\) groups of sample sizes \(n_j=100\), \(j \in 1\dots,g\). The data of the 50 groups were generated separately and then stacked upon each other to create data sets in *long format*. Each data set consisted of one continuous individual level predictor \(x\), and the overdispersed dependent count variable \(y\). \(x\) both influenced the zero and the count process. The intercept and the slope of \(x\) were treated as *random* in both the zero and the count part of the model. We set the following population parameters: \(\beta_0=1\), and \(\beta_1=.75\) the fixed intercept and slope terms respectively of the count part of the model, and \(\gamma_0=.0\), and \(\gamma_1=.5\) the fixed intercept and slope terms in the zero part of the model. Random effects standard deviations were \(.5\) and \(.3\) for the intercept term (\(\tau_{00}\)) and the slope of \(x\) (\(\tau_{11}\)) respectively. The random effects correlation (\(\tau_{01}\)) was set to \(0\). \(x_{ij}\) were simulated from a \(\mathcal{N}(0,1)\) normal population. Random effects \(u\) were also drawn from normal \(\mathcal{N}(0,\tau)\) populations. Data were generated as follows. We first obtained parameters for the zero part and the count part of the model. For each group \(j\), \(j\in{1,\dots,g}\) we simulated parameters:

The probability of belonging to the zero versus non-zero category was then determined by

\[\begin{equation*} \mu_{ij}=\text{invlogit}(\gamma_{0j}+\gamma_{1j}x_{1i}) \end{equation*}\]We then drew random uniform \(z\sim\mathcal{U}(0,1)\) numbers and if \(z_{ij}<\mu_{ij}\) then the dependent variable \(y\) was set to 0, and 1 otherwise. In the next step, a nonzero observation in \(y\) was replaced by a random draw from a zero-truncated Poisson distribution with mean \(\exp(\beta_{0j}+\beta_{1j}x_{1ij})\) in the Poisson condition, and by a random draw from a zero-truncated NB distribution with mean \(\exp(\beta_{0j}+\beta_{1j} x_{1ij})\) and overdispersion parameter \(\theta=2\) in the NB condition.

Then, we introduced missing data in \(y\). Predictor \(x\) remained fully observed. We introduced missing data according to the following rule, where \(p_{ij}\) is the missingness probability and NA denotes a missing value (i.e. “not available”):

\[\begin{align*} p_{ij}&=\text{invlogit}(-1+x_{ij}) \\ v_{ij}&\sim \mathcal{U}(0,1) \\ y_{ij} &= \text{NA, } \text{if } v_{ij}<p_{ij}. \end{align*}\]This created an average percentage of missing data in \(y\) of 30.4%.

The incomplete variable \(y\) was then imputed \(m=5\) times using both the Bayesian regression and bootstrap regression versions of the respective function (hurdle Poisson imputation in the Poisson condition, and hurdle NB imputation in the negative binomial condition). \(x\) was included as predictor for the zero and count part of the model. The intercept and the slope of \(x\) were treated as random factors (again in both parts of the model). We used the `mice`

default settings and monitored convergence (for details, see Chapter 5). The \(m\) completed data sets were then analyzed by the same model that was used for data imputation. Results were combined using Rubin’s formula (Rubin, 1987).

Results of hurdle Poisson imputation are displayed in Tables 4.12 and 4.13, results of hurdle NB imputation are displayed in Tables 4.14 and 4.15. The tables display the average parameter estimate across the 200 replications, the standard deviation of the estimates across the replications, bias, percent bias and 95% coverage.

\(Q\) | \(\widehat{Q}\) | Bias | %Bias | Coverage Rate | |
---|---|---|---|---|---|

\(\beta_0\) | 1.00 | 1.001 | -0.001 | -0.100 | 94.924 |

\(\beta_1\) | 0.75 | 0.746 | 0.004 | 0.533 | 93.909 |

\(\gamma_0\) | 0.00 | -0.006 | 0.006 | Inf | 92.893 |

\(\gamma_1\) | 0.50 | 0.492 | 0.008 | 1.600 | 96.447 |

\(\sigma_{00c}\) | 0.50 | 0.491 | 0.009 | 1.800 | NA |

\(\sigma_{11c}\) | 0.30 | 0.283 | 0.017 | 5.667 | NA |

\(\sigma_{01c}\) | 0.00 | -0.008 | 0.008 | Inf | NA |

\(\sigma_{00z}\) | 0.50 | 0.477 | 0.023 | 4.600 | NA |

\(\sigma_{11z}\) | 0.30 | 0.271 | 0.029 | 9.667 | NA |

\(\sigma_{01z}\) | 0.00 | -0.053 | 0.053 | Inf | NA |

\(Q\) | \(\widehat{Q}\) | Bias | %Bias | Coverage Rate | |
---|---|---|---|---|---|

\(\beta_0\) | 1.00 | 1.011 | -0.011 | -1.100 | 94.416 |

\(\beta_1\) | 0.75 | 0.748 | 0.002 | 0.267 | 96.447 |

\(\gamma_0\) | 0.00 | -0.006 | 0.006 | Inf | 91.878 |

\(\gamma_1\) | 0.50 | 0.495 | 0.005 | 1.000 | 91.878 |

\(\sigma_{00c}\) | 0.50 | 0.446 | 0.054 | 10.800 | NA |

\(\sigma_{11c}\) | 0.30 | 0.261 | 0.039 | 13.000 | NA |

\(\sigma_{01c}\) | 0.00 | -0.127 | 0.127 | Inf | NA |

\(\sigma_{00z}\) | 0.50 | 0.439 | 0.061 | 12.200 | NA |

\(\sigma_{11z}\) | 0.30 | 0.269 | 0.031 | 10.333 | NA |

\(\sigma_{01z}\) | 0.00 | -0.108 | 0.108 | Inf | NA |

\(Q\) | \(\widehat{Q}\) | Bias | %Bias | Coverage Rate | |
---|---|---|---|---|---|

\(\beta_0\) | 1.00 | 0.993 | 0.007 | 0.700 | 93.939 |

\(\beta_1\) | 0.75 | 0.735 | 0.015 | 2.000 | 92.424 |

\(\gamma_0\) | 0.00 | -0.003 | 0.003 | Inf | 93.939 |

\(\gamma_1\) | 0.50 | 0.497 | 0.003 | 0.600 | 96.970 |

\(\sigma_{00c}\) | 0.50 | 0.480 | 0.020 | 4.000 | NA |

\(\sigma_{11c}\) | 0.30 | 0.276 | 0.024 | 8.000 | NA |

\(\sigma_{01c}\) | 0.00 | -0.059 | 0.059 | Inf | NA |

\(\sigma_{00z}\) | 0.50 | 0.485 | 0.015 | 3.000 | NA |

\(\sigma_{11z}\) | 0.30 | 0.274 | 0.026 | 8.667 | NA |

\(\sigma_{01z}\) | 0.00 | -0.054 | 0.054 | Inf | NA |

\(\theta\) | 2.00 | 2.098 | -0.098 | -4.900 | NA |

\(Q\) | \(\widehat{Q}\) | Bias | %Bias | Coverage Rate | |
---|---|---|---|---|---|

\(\beta_0\) | 1.00 | 0.996 | 0.004 | 0.400 | 92.929 |

\(\beta_1\) | 0.75 | 0.743 | 0.007 | 0.933 | 93.939 |

\(\gamma_0\) | 0.00 | -0.001 | 0.001 | Inf | 93.434 |

\(\gamma_1\) | 0.50 | 0.500 | 0.000 | 0.000 | 95.455 |

\(\sigma_{00c}\) | 0.50 | 0.440 | 0.060 | 12.000 | NA |

\(\sigma_{11c}\) | 0.30 | 0.262 | 0.038 | 12.667 | NA |

\(\sigma_{01c}\) | 0.00 | -0.151 | 0.151 | Inf | NA |

\(\sigma_{00z}\) | 0.50 | 0.444 | 0.056 | 11.200 | NA |

\(\sigma_{11z}\) | 0.30 | 0.268 | 0.032 | 10.667 | NA |

\(\sigma_{01z}\) | 0.00 | -0.127 | 0.127 | Inf | NA |

\(\theta\) | 2.00 | 1.991 | 0.009 | 0.450 | NA |

Estimates are sufficiently accurate. It appears that the bootstrap variants performed slightly worse than the Bayesian variants, especially regarding the random part of the model. This might be caused by the block bootstrap approach implemented here (we re-sample clusters rather than individual cases). This approach requires a substantial number of clusters. Future research needs to establish under which scenarios it is better to use the Baysian variants of the functions and when to use the bootstrap variants. Little is yet known about level-1 and level-2 sample size requirements.