Chapter 1 Overview

Count data require special analysis techniques (for a comprehensive overview, see Hilbe, 2011): Ordinary count data can be analyzed by a classical Poisson regression model. Overdispersed count data, i.e., count data that have a larger variance in comparison to their mean are usually modeled by either quasi-Poisson or negative binomial (NB) models (for a distinction of true and apparent overdispersion, see Hilbe, 2011). Zero-inflated count data, i.e., count data that contain more zero counts than would be expected by either the Poisson, quasi-Poisson or NB model, can be analyzed by either zero-inflated Poisson or NB models, or by hurdle Poisson or NB models (see Zeileis, Kleiber, & Jackman, 2008 for an overview). Furthermore, generalized linear mixed effects versions of these models can be fit to cater for a clustered structure of the data.

Missing data methods and especially multiple imputation procedures for count data, however, are currently still very scarce. In practice, the missing data problem in count variables is often handled by (a) ignoring that the data are counts and by proceeding as if they were continuous, (b) by treating the data as (ordered) categorical. We regard these solutions as quick fixes (that may work in some settings), but not as an optimal and general solution to adequately analyze incomplete count data and get precise and unbiased parameter estimates as well as standard errors in a wide variety of scenarios. See for example Yu, Burton, & Rivero-Arias (2007) or Kleinke (2017) for situations, where imputation models based on the normal model fail. Furthermore, see Chapter 3 for an evaluation of the strategy to treat count data as if they were ordered categorical.

Based on research by Yu et al. (2007) and Kleinke (2017), we recommend to use imputation methods with fitting parametrical assumptions.

Currently, the following packages (apart from countimp) allow to handle missing data based on various count models:

  • IVEware, which is available as a SAS add-on or a standalone version can create imputations based on an ordinary Poisson model.
  • R package mi supports flat-file count data imputation. Zero-inflation models, hurdle models, and multilevel count models are not supported.
  • ice for STATA supports count data imputation under a Poisson or NB model.

This document is structured as follows:

  • Chapter 2 gives information about missing data and multiple imputation in general, and familiarizes readers with typical count data models.
  • In Chapter 3, we illustrate by means of Monte Carlo simulation, why it may not be a good idea to treat counts as if they were either continuous or (ordered) categorical.
  • In Chapter 4, we describe package countimp in detail and provide practical examples as well as Monte Carlo simulations to assess both the quality of the newly introduced and the refurbished algorithms from Version 2 of package countimp.

An evaluation of the algorithms from Version 1 of package countimp may furthermore be found in:

  • Kleinke & Reinecke (2013)
  • Kleinke & Reinecke (2015a)
  • Kleinke & Reinecke (2015b)

What is new in version 2?

  • We have included support for both hurdle and zero-inflation models for both flat-file and multilevel data sets (only two-level models are currently supported).
  • We have replaced package glmmADMB with glmmTMB to fit the two-level NB, zero-inflation, and hurdle models, which in our experience runs more stable.
  • We have adapted our functions to the new mice architecture (countimp version 1 only works with mice version 2.35 or older).
  • We have included automatic handling of outliers in the imputed values (argument EV=TRUE). Note that extreme imputations might indicate that there are problems with the imputation method, model and / or the algorithm. If extreme imputations are detected, the respective imputations are replaced by imputations obtained by mice.impute.midastouch(), a predective mean matching (PMM) variant, described in Gaffert, Meinfelder, & Bosch (2016). PMM is usually a good allround imputation method that works in may scenarios (van Buuren, 2012).
  • We have included a two-level predictive mean matching algorithm. PMM can be enumerated among the hot deck procedures or k-nearest neighbour procedures and imputes an actual observed value. Based on predictions by a normal two-level linear mixed effects model, an observed case (donor) is matched to an incomplete case and the actual observed value replaces the missing one. By imputing only observed values, PMM is able (up to a certain degree) to preserve the original distribution of variable at hand and can buffer mild to moderate violations of model assumptions. On the downside, PMM cannot make any extrapolations beyond the observed range of the remaining cases and might not be a good method in situations, where the distribution of the observed and the assumed distribution of the unobserved cases are highly distinct. A description of the method, as well as results of a first Monte Carlo simulation may be obtained from https://www.kkleinke.com/files/slides/2016-09-dgps.pdf.
  • all imputation functions are availabe als Bayesian regression and bootstrap regression variants (see Chapter 2.1.4). These are two different methods to introduce between-imputation variability. The flat-file bootstrap functions re-sample individual observations, the two-level functions re-sample clusters. Note that this requires a substantial level-2 sample size. If this is not the case, it is advisable to use the Bayesian variant instead. For further information, see the Monte Carlo simulation in Chapter 4.10.