Zero-inflated Poisson model for complex survey data

The Zero-Inflated Poisson model is a model for count data with excess zeros. The response distribution is a mixture of a point mass at zero and a Poisson distribution: if Z is Bernoulli with probability 1 − p0 and P is Poisson with mean λ, then Y = Z + (1 − Z)P is zero-inflated Poisson. The ZIP is a latent-class model; we can have Y = 0 either because Z = 0 (‘structural’ zeroes) or because P = 0. That’s natural in some ecological examples: if you didn’t see any salmon it could be because the area is salmon-free (it’s Eden Park) or because you just randomly didn’t see any. To turn this into a regression model we typically put a logistic regression structure on Z and a Poisson regression structure on P.

There isn’t (as far as I know) existing software in R for design-based inference in zero-inflated Poisson models, so it’s a good example for the benefits of svyVGAM. The pscl package (Zeileis et al) fits zero-inflated models, and so does VGAM, so we can compare the model fitted with svyVGAM to both of those and to other work-arounds.

I’ll do an example with data on number of sexual partners, from NHANES 2003-2004. We will look at questions SXQ200 and SXQ100: number of male sexual partners. Combining these gives a ‘real’ zero-inflated variable: based on sexual orientation the zeroes would divide into ‘never’ and ‘not yet’.

Here’s how I created the dataset, from two NHANES files. It’s data(nhanes_sxq) in the package

library(foreign)
setwd("~/nhanes")
demo = read.xport("demo_c.xpt")
sxq = read.xport("sxq_c.xpt")
merged = merge(demo, sxq, by='SEQN')
merged$total = with(merged, ifelse(RIAGENDR==2, SXQ100+SXQ130, SXQ170+SXQ200))
merged$total[merged$SXQ020==2] = 0
merged$total[merged$total>2000] = NA
merged$age = merged$RIDAGEYR/25
merged$malepartners=with(merged, ifelse(RIAGENDR==2,SXQ100,SXQ200))
merged$malepartners[merged$malepartners>200]=NA
nhanes_sxq<-merged[, c("SDMVPSU","SDMVSTRA","WTINT2YR","RIDAGEYR","RIDRETH1","DMDEDUC","malepartners")]

Start off by loading the packages and setting up a survey design

library(svyVGAM)
## Loading required package: VGAM
## Loading required package: stats4
## Loading required package: splines
## Loading required package: survey
## Loading required package: grid
## Loading required package: Matrix
## Loading required package: survival
## 
## Attaching package: 'survey'
## The following object is masked from 'package:VGAM':
## 
##     calibrate
## The following object is masked from 'package:graphics':
## 
##     dotchart
library(pscl)
## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.
data(nhanes_sxq)
des = svydesign(id=~SDMVPSU,strat=~SDMVSTRA,weights=~WTINT2YR, nest=TRUE, data=nhanes_sxq)

First, we’ll fit the model just ignoring the survey design, using both pscl::zeroinfl and VGAM::vglm. These models use the same variables in a logistic regression for Z and a Poisson regression for P. In VGAM you would make the models different by constraining coefficients to be zero in one of the models; in pscl you would specify different models before and after the |.

unwt = zeroinfl(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC|RIDAGEYR+factor(RIDRETH1)+DMDEDUC, data=nhanes_sxq)
summary(unwt)
## 
## Call:
## zeroinfl(formula = malepartners ~ RIDAGEYR + factor(RIDRETH1) + DMDEDUC | 
##     RIDAGEYR + factor(RIDRETH1) + DMDEDUC, data = nhanes_sxq)
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0204 -0.9433 -0.7871  0.1503 29.2567 
## 
## Count model coefficients (poisson with log link):
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        1.6666218  0.0506660  32.894  < 2e-16 ***
## RIDAGEYR          -0.0055102  0.0008969  -6.143 8.08e-10 ***
## factor(RIDRETH1)2 -0.0394018  0.0779480  -0.505    0.613    
## factor(RIDRETH1)3  0.6508824  0.0345734  18.826  < 2e-16 ***
## factor(RIDRETH1)4  0.6675320  0.0365963  18.240  < 2e-16 ***
## factor(RIDRETH1)5  0.5642960  0.0594928   9.485  < 2e-16 ***
## DMDEDUC            0.0094257  0.0135180   0.697    0.486    
## 
## Zero-inflation model coefficients (binomial with logit link):
##                    Estimate Std. Error z value Pr(>|z|)   
## (Intercept)        0.188125   0.187079   1.006  0.31461   
## RIDAGEYR          -0.002938   0.003629  -0.810  0.41822   
## factor(RIDRETH1)2 -0.079637   0.242312  -0.329  0.74242   
## factor(RIDRETH1)3  0.118370   0.116120   1.019  0.30802   
## factor(RIDRETH1)4  0.143301   0.127764   1.122  0.26203   
## factor(RIDRETH1)5  0.259516   0.223032   1.164  0.24459   
## DMDEDUC           -0.148881   0.053337  -2.791  0.00525 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 21 
## Log-likelihood: -9518 on 14 Df
vglm(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC, zipoisson(), data = nhanes_sxq, crit = "coef")
## 
## Call:
## vglm(formula = malepartners ~ RIDAGEYR + factor(RIDRETH1) + DMDEDUC, 
##     family = zipoisson(), data = nhanes_sxq, crit = "coef")
## 
## 
## Coefficients:
##       (Intercept):1       (Intercept):2          RIDAGEYR:1          RIDAGEYR:2 
##         0.188125349         1.666622759        -0.002937819        -0.005510160 
## factor(RIDRETH1)2:1 factor(RIDRETH1)2:2 factor(RIDRETH1)3:1 factor(RIDRETH1)3:2 
##        -0.079635992        -0.039401949         0.118369301         0.650882145 
## factor(RIDRETH1)4:1 factor(RIDRETH1)4:2 factor(RIDRETH1)5:1 factor(RIDRETH1)5:2 
##         0.143300364         0.667531080         0.259515415         0.564295398 
##           DMDEDUC:1           DMDEDUC:2 
##        -0.148881313         0.009425589 
## 
## Degrees of Freedom: 5050 Total; 5036 Residual
## Log-likelihood: -9517.556

Re-scaling the weights

A traditional work-around for regression models is to rescale the weights to sum to the sample size and then pretend they are precision weights or frequency weights.

nhanes_sxq$scaledwt<-nhanes_sxq$WTINT2YR/mean(nhanes_sxq$WTINT2YR)

wt= zeroinfl(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC|RIDAGEYR+factor(RIDRETH1)+DMDEDUC, data=nhanes_sxq, weights=scaledwt)
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
summary(wt)
## 
## Call:
## zeroinfl(formula = malepartners ~ RIDAGEYR + factor(RIDRETH1) + DMDEDUC | 
##     RIDAGEYR + factor(RIDRETH1) + DMDEDUC, data = nhanes_sxq, weights = scaledwt)
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5864 -0.8418 -0.5430  0.1324 31.9106 
## 
## Count model coefficients (poisson with log link):
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        1.8340666  0.0614994  29.823  < 2e-16 ***
## RIDAGEYR          -0.0073881  0.0009059  -8.155 3.49e-16 ***
## factor(RIDRETH1)2 -0.1073243  0.0853534  -1.257   0.2086    
## factor(RIDRETH1)3  0.6551385  0.0481679  13.601  < 2e-16 ***
## factor(RIDRETH1)4  0.6358167  0.0529173  12.015  < 2e-16 ***
## factor(RIDRETH1)5  0.4774152  0.0666607   7.162 7.96e-13 ***
## DMDEDUC           -0.0237391  0.0143070  -1.659   0.0971 .  
## 
## Zero-inflation model coefficients (binomial with logit link):
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        0.660511   0.214458   3.080 0.002071 ** 
## RIDAGEYR          -0.007833   0.003673  -2.133 0.032957 *  
## factor(RIDRETH1)2 -0.116798   0.252451  -0.463 0.643610    
## factor(RIDRETH1)3  0.101968   0.151531   0.673 0.500999    
## factor(RIDRETH1)4 -0.160809   0.181429  -0.886 0.375430    
## factor(RIDRETH1)5  0.106776   0.230339   0.464 0.642964    
## DMDEDUC           -0.202397   0.057624  -3.512 0.000444 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 20 
## Log-likelihood: -9766 on 14 Df
wtv= vglm(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC, zipoisson(), data = nhanes_sxq, crit = "coef",weights=scaledwt)
summary(wtv)
## Call:
## vglm(formula = malepartners ~ RIDAGEYR + factor(RIDRETH1) + DMDEDUC, 
##     family = zipoisson(), data = nhanes_sxq, weights = scaledwt, 
##     crit = "coef")
## 
## Coefficients: 
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept):1        0.6605042  0.2144354   3.080 0.002069 ** 
## (Intercept):2        1.8340681  0.0614568  29.843  < 2e-16 ***
## RIDAGEYR:1          -0.0078333  0.0036726  -2.133 0.032934 *  
## RIDAGEYR:2          -0.0073881  0.0008995  -8.214  < 2e-16 ***
## factor(RIDRETH1)2:1 -0.1167894  0.2527278  -0.462 0.643999    
## factor(RIDRETH1)2:2 -0.1073312  0.0847641  -1.266 0.205430    
## factor(RIDRETH1)3:1  0.1019712  0.1515002   0.673 0.500899    
## factor(RIDRETH1)3:2  0.6551367  0.0481359  13.610  < 2e-16 ***
## factor(RIDRETH1)4:1 -0.1608040  0.1814098  -0.886 0.375395    
## factor(RIDRETH1)4:2  0.6358147  0.0529738  12.002  < 2e-16 ***
## factor(RIDRETH1)5:1  0.1067789  0.2303235   0.464 0.642931    
## factor(RIDRETH1)5:2  0.4774124  0.0663590   7.194 6.27e-13 ***
## DMDEDUC:1           -0.2023967  0.0576221  -3.512 0.000444 ***
## DMDEDUC:2           -0.0237389  0.0146964  -1.615 0.106249    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Names of linear predictors: logitlink(pstr0), loglink(lambda)
## 
## Log-likelihood: -9765.52 on 5036 degrees of freedom
## 
## Number of Fisher scoring iterations: 8 
## 
## No Hauck-Donner effect found in any of the estimates
## repwts
repdes = as.svrepdesign(des,type="Fay",fay.rho=0.2)
rep1 = withReplicates(repdes, quote( 
    coef(zeroinfl(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC|RIDAGEYR+factor(RIDRETH1)+DMDEDUC, weights=.weights))
    ))
rep1
##                              theta     SE
## count_(Intercept)        1.8340684 0.1350
## count_RIDAGEYR          -0.0073881 0.0028
## count_factor(RIDRETH1)2 -0.1073314 0.2471
## count_factor(RIDRETH1)3  0.6551358 0.1857
## count_factor(RIDRETH1)4  0.6358143 0.1438
## count_factor(RIDRETH1)5  0.4774122 0.2501
## count_DMDEDUC           -0.0237386 0.0797
## zero_(Intercept)         0.6605043 0.2582
## zero_RIDAGEYR           -0.0078333 0.0039
## zero_factor(RIDRETH1)2  -0.1167895 0.2854
## zero_factor(RIDRETH1)3   0.1019712 0.0983
## zero_factor(RIDRETH1)4  -0.1608040 0.0859
## zero_factor(RIDRETH1)5   0.1067790 0.2120
## zero_DMDEDUC            -0.2023967 0.0586
repv = withReplicates(repdes, quote( 
    coef(vglm(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC, zipoisson(), data = nhanes_sxq, crit = "coef",weights=.weights))
    ))
repv
##                          theta     SE
## (Intercept):1        0.6605042 0.2582
## (Intercept):2        1.8340681 0.1350
## RIDAGEYR:1          -0.0078333 0.0039
## RIDAGEYR:2          -0.0073881 0.0028
## factor(RIDRETH1)2:1 -0.1167894 0.2854
## factor(RIDRETH1)2:2 -0.1073312 0.2471
## factor(RIDRETH1)3:1  0.1019712 0.0983
## factor(RIDRETH1)3:2  0.6551367 0.1857
## factor(RIDRETH1)4:1 -0.1608040 0.0859
## factor(RIDRETH1)4:2  0.6358147 0.1438
## factor(RIDRETH1)5:1  0.1067789 0.2120
## factor(RIDRETH1)5:2  0.4774124 0.2501
## DMDEDUC:1           -0.2023967 0.0586
## DMDEDUC:2           -0.0237389 0.0797

svymle

Another way to fit the model using just the survey package is with svymle. This takes the log-likelihood and its derivative as arguments, and adds linear predictors to some or all of those arguments. That is, we specify the log-likelihood in terms of the Bernoulli parameter p0 and the Poisson mean λ – actually logitp0 and η = log λ, and also give the derivative with respect to these two parameters. The software does the necessary additional work to put linear predictors on the parameters and give us the zero-inflated model. In fact, svymle is very similar in underlying approach to vglm; the difference is that vglm comes with a large collection of predefined models.

In defining the loglikelihood I’m going to take advantage of the Poisson pmf being available in R. Let’s call it ϝ(y, λ). The loglikelihood is ℓ(y; μ, p0) = log (p0{y =  = 0} + (1 − p)ϝ(y, μ)) only we want it in terms of logitp0 and η = log λ

loglike = function(y,eta,logitp){
    mu = exp(eta)
    p = exp(logitp)/(1+exp(logitp))
    log(p*(y==0)+(1-p)*dpois(y,mu))
}

We also need the derivatives with respect to logitp0 and η = log λ

dlogitp = function(y,eta,logitp){
    mu = exp(eta)
    p = exp(logitp)/(1+exp(logitp))
    dexpit = p/(1+p)^2
    num = dexpit*(y==0)-dexpit*dpois(y,mu)
    denom = p*(y==0)+(1-p)*dpois(y,mu)
    num/denom
    }   
    
deta = function(y,eta,logitp){
    mu = exp(eta)
    p = exp(logitp)/(1+exp(logitp))
    dmutoy = 0*y
    dmutoy[y>0] = exp(-mu[y>0])*mu[y>0]^(y[y>0]-1)/factorial(y[y>0]-1)
    num = (1-p)*(-dpois(y,mu)+dmutoy)
    denom = p*(y==0)+(1-p)*dpois(y,mu)
    num/denom
    }   

score = function(y,eta,logitp) cbind(deta(y,eta,logitp), dlogitp(y,eta,logitp))

And now we call svymle giving the linear predictors for both parameters. One of the formulas needs to include the response variable Y.

nlmfit = svymle(loglike=loglike, grad=score, design=des, 
        formulas=list(eta=malepartners~RIDAGEYR + factor(RIDRETH1) + DMDEDUC, 
        logitp=~RIDAGEYR + factor(RIDRETH1) + DMDEDUC),
      start=coef(unwt), na.action="na.omit",method="BFGS")

summary(nlmfit)
## Survey-sampled mle: 
## svymle(loglike = loglike, gradient = score, design = des, formulas = list(eta = malepartners ~ 
##     RIDAGEYR + factor(RIDRETH1) + DMDEDUC, logitp = ~RIDAGEYR + 
##     factor(RIDRETH1) + DMDEDUC), start = coef(unwt), na.action = "na.omit", 
##     method = "BFGS")
##                                  Coef          SE p.value
## eta.(Intercept)           1.826824375 0.154214263 < 0.001
## eta.RIDAGEYR             -0.007800674 0.003014996 0.00967
## eta.factor(RIDRETH1)2    -0.119695528 0.235192608 0.61080
## eta.factor(RIDRETH1)3     0.639831707 0.165176844 < 0.001
## eta.factor(RIDRETH1)4     0.615167757 0.117750557 < 0.001
## eta.factor(RIDRETH1)5     0.465556579 0.213462435 0.02919
## eta.DMDEDUC              -0.008130626 0.072679434 0.91093
## logitp.(Intercept)        0.578310456 0.246782589 0.01911
## logitp.RIDAGEYR          -0.006077530 0.004017016 0.13029
## logitp.factor(RIDRETH1)2 -0.033442030 0.280701222 0.90517
## logitp.factor(RIDRETH1)3  0.124434633 0.095140201 0.19090
## logitp.factor(RIDRETH1)4 -0.151763506 0.086322700 0.07873
## logitp.factor(RIDRETH1)5  0.119530402 0.209380277 0.56808
## logitp.DMDEDUC           -0.209112732 0.053553190 < 0.001
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (30) clusters.
## svydesign(id = ~SDMVPSU, strat = ~SDMVSTRA, weights = ~WTINT2YR, 
##     nest = TRUE, data = nhanes_sxq)

svyVGAM

Finally, we use svy_vglm, with variances by linearisation

library(svyVGAM)
svy_vglm(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC, zipoisson(), design=des, crit = "coef")
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (30) clusters.
## svydesign(id = ~SDMVPSU, strat = ~SDMVSTRA, weights = ~WTINT2YR, 
##     nest = TRUE, data = nhanes_sxq)
## 
## Call:
## vglm(formula = formula, family = family, data = surveydata, weights = .survey.prob.weights, 
##     crit = "coef")
## 
## 
## Coefficients:
##       (Intercept):1       (Intercept):2          RIDAGEYR:1          RIDAGEYR:2 
##         0.660504243         1.834068104        -0.007833317        -0.007388071 
## factor(RIDRETH1)2:1 factor(RIDRETH1)2:2 factor(RIDRETH1)3:1 factor(RIDRETH1)3:2 
##        -0.116789371        -0.107331190         0.101971159         0.655136725 
## factor(RIDRETH1)4:1 factor(RIDRETH1)4:2 factor(RIDRETH1)5:1 factor(RIDRETH1)5:2 
##        -0.160804047         0.635814748         0.106778915         0.477412443 
##           DMDEDUC:1           DMDEDUC:2 
##        -0.202396659        -0.023738881 
## 
## Degrees of Freedom: 5050 Total; 5036 Residual
## Log-likelihood: -493703966

and by replicate weights

svy_vglm(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC, zipoisson(), design=repdes, crit = "coef")