The Zero-Inflated Poisson model is a model for count data with excess zeros. The response distribution is a mixture of a point mass at zero and a Poisson distribution: if Z is Bernoulli with probability 1 − p0 and P is Poisson with mean λ, then Y = Z + (1 − Z)P is zero-inflated Poisson. The ZIP is a latent-class model; we can have Y = 0 either because Z = 0 (‘structural’ zeroes) or because P = 0. That’s natural in some ecological examples: if you didn’t see any salmon it could be because the area is salmon-free (it’s Eden Park) or because you just randomly didn’t see any. To turn this into a regression model we typically put a logistic regression structure on Z and a Poisson regression structure on P.
There isn’t (as far as I know) existing software in R for
design-based inference in zero-inflated Poisson models, so it’s a good
example for the benefits of svyVGAM
. The pscl
package (Zeileis et al) fits zero-inflated models, and so does
VGAM
, so we can compare the model fitted with
svyVGAM
to both of those and to other work-arounds.
I’ll do an example with data on number of sexual partners, from
NHANES 2003-2004. We will look at questions SXQ200
and
SXQ100
: number of male sexual partners. Combining these
gives a ‘real’ zero-inflated variable: based on sexual orientation the
zeroes would divide into ‘never’ and ‘not yet’.
Here’s how I created the dataset, from two NHANES files. It’s
data(nhanes_sxq)
in the package
library(foreign)
setwd("~/nhanes")
demo = read.xport("demo_c.xpt")
sxq = read.xport("sxq_c.xpt")
merged = merge(demo, sxq, by='SEQN')
merged$total = with(merged, ifelse(RIAGENDR==2, SXQ100+SXQ130, SXQ170+SXQ200))
merged$total[merged$SXQ020==2] = 0
merged$total[merged$total>2000] = NA
merged$age = merged$RIDAGEYR/25
merged$malepartners=with(merged, ifelse(RIAGENDR==2,SXQ100,SXQ200))
merged$malepartners[merged$malepartners>200]=NA
nhanes_sxq<-merged[, c("SDMVPSU","SDMVSTRA","WTINT2YR","RIDAGEYR","RIDRETH1","DMDEDUC","malepartners")]
Start off by loading the packages and setting up a survey design
## Loading required package: VGAM
## Loading required package: stats4
## Loading required package: splines
## Loading required package: survey
## Loading required package: grid
## Loading required package: Matrix
## Loading required package: survival
##
## Attaching package: 'survey'
## The following object is masked from 'package:VGAM':
##
## calibrate
## The following object is masked from 'package:graphics':
##
## dotchart
## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.
data(nhanes_sxq)
des = svydesign(id=~SDMVPSU,strat=~SDMVSTRA,weights=~WTINT2YR, nest=TRUE, data=nhanes_sxq)
First, we’ll fit the model just ignoring the survey design, using
both pscl::zeroinfl
and VGAM::vglm
. These
models use the same variables in a logistic regression for Z and a Poisson regression for P. In VGAM
you would
make the models different by constraining coefficients to be zero in one
of the models; in pscl
you would specify different models
before and after the |
.
unwt = zeroinfl(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC|RIDAGEYR+factor(RIDRETH1)+DMDEDUC, data=nhanes_sxq)
summary(unwt)
##
## Call:
## zeroinfl(formula = malepartners ~ RIDAGEYR + factor(RIDRETH1) + DMDEDUC |
## RIDAGEYR + factor(RIDRETH1) + DMDEDUC, data = nhanes_sxq)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -1.0204 -0.9433 -0.7871 0.1503 29.2567
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.6666218 0.0506660 32.894 < 2e-16 ***
## RIDAGEYR -0.0055102 0.0008969 -6.143 8.08e-10 ***
## factor(RIDRETH1)2 -0.0394018 0.0779480 -0.505 0.613
## factor(RIDRETH1)3 0.6508824 0.0345734 18.826 < 2e-16 ***
## factor(RIDRETH1)4 0.6675320 0.0365963 18.240 < 2e-16 ***
## factor(RIDRETH1)5 0.5642960 0.0594928 9.485 < 2e-16 ***
## DMDEDUC 0.0094257 0.0135180 0.697 0.486
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.188125 0.187079 1.006 0.31461
## RIDAGEYR -0.002938 0.003629 -0.810 0.41822
## factor(RIDRETH1)2 -0.079637 0.242312 -0.329 0.74242
## factor(RIDRETH1)3 0.118370 0.116120 1.019 0.30802
## factor(RIDRETH1)4 0.143301 0.127764 1.122 0.26203
## factor(RIDRETH1)5 0.259516 0.223032 1.164 0.24459
## DMDEDUC -0.148881 0.053337 -2.791 0.00525 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 21
## Log-likelihood: -9518 on 14 Df
##
## Call:
## vglm(formula = malepartners ~ RIDAGEYR + factor(RIDRETH1) + DMDEDUC,
## family = zipoisson(), data = nhanes_sxq, crit = "coef")
##
##
## Coefficients:
## (Intercept):1 (Intercept):2 RIDAGEYR:1 RIDAGEYR:2
## 0.188125349 1.666622759 -0.002937819 -0.005510160
## factor(RIDRETH1)2:1 factor(RIDRETH1)2:2 factor(RIDRETH1)3:1 factor(RIDRETH1)3:2
## -0.079635992 -0.039401949 0.118369301 0.650882145
## factor(RIDRETH1)4:1 factor(RIDRETH1)4:2 factor(RIDRETH1)5:1 factor(RIDRETH1)5:2
## 0.143300364 0.667531080 0.259515415 0.564295398
## DMDEDUC:1 DMDEDUC:2
## -0.148881313 0.009425589
##
## Degrees of Freedom: 5050 Total; 5036 Residual
## Log-likelihood: -9517.556
A traditional work-around for regression models is to rescale the weights to sum to the sample size and then pretend they are precision weights or frequency weights.
nhanes_sxq$scaledwt<-nhanes_sxq$WTINT2YR/mean(nhanes_sxq$WTINT2YR)
wt= zeroinfl(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC|RIDAGEYR+factor(RIDRETH1)+DMDEDUC, data=nhanes_sxq, weights=scaledwt)
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
##
## Call:
## zeroinfl(formula = malepartners ~ RIDAGEYR + factor(RIDRETH1) + DMDEDUC |
## RIDAGEYR + factor(RIDRETH1) + DMDEDUC, data = nhanes_sxq, weights = scaledwt)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -1.5864 -0.8418 -0.5430 0.1324 31.9106
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.8340666 0.0614994 29.823 < 2e-16 ***
## RIDAGEYR -0.0073881 0.0009059 -8.155 3.49e-16 ***
## factor(RIDRETH1)2 -0.1073243 0.0853534 -1.257 0.2086
## factor(RIDRETH1)3 0.6551385 0.0481679 13.601 < 2e-16 ***
## factor(RIDRETH1)4 0.6358167 0.0529173 12.015 < 2e-16 ***
## factor(RIDRETH1)5 0.4774152 0.0666607 7.162 7.96e-13 ***
## DMDEDUC -0.0237391 0.0143070 -1.659 0.0971 .
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.660511 0.214458 3.080 0.002071 **
## RIDAGEYR -0.007833 0.003673 -2.133 0.032957 *
## factor(RIDRETH1)2 -0.116798 0.252451 -0.463 0.643610
## factor(RIDRETH1)3 0.101968 0.151531 0.673 0.500999
## factor(RIDRETH1)4 -0.160809 0.181429 -0.886 0.375430
## factor(RIDRETH1)5 0.106776 0.230339 0.464 0.642964
## DMDEDUC -0.202397 0.057624 -3.512 0.000444 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 20
## Log-likelihood: -9766 on 14 Df
wtv= vglm(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC, zipoisson(), data = nhanes_sxq, crit = "coef",weights=scaledwt)
summary(wtv)
## Call:
## vglm(formula = malepartners ~ RIDAGEYR + factor(RIDRETH1) + DMDEDUC,
## family = zipoisson(), data = nhanes_sxq, weights = scaledwt,
## crit = "coef")
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept):1 0.6605042 0.2144354 3.080 0.002069 **
## (Intercept):2 1.8340681 0.0614568 29.843 < 2e-16 ***
## RIDAGEYR:1 -0.0078333 0.0036726 -2.133 0.032934 *
## RIDAGEYR:2 -0.0073881 0.0008995 -8.214 < 2e-16 ***
## factor(RIDRETH1)2:1 -0.1167894 0.2527278 -0.462 0.643999
## factor(RIDRETH1)2:2 -0.1073312 0.0847641 -1.266 0.205430
## factor(RIDRETH1)3:1 0.1019712 0.1515002 0.673 0.500899
## factor(RIDRETH1)3:2 0.6551367 0.0481359 13.610 < 2e-16 ***
## factor(RIDRETH1)4:1 -0.1608040 0.1814098 -0.886 0.375395
## factor(RIDRETH1)4:2 0.6358147 0.0529738 12.002 < 2e-16 ***
## factor(RIDRETH1)5:1 0.1067789 0.2303235 0.464 0.642931
## factor(RIDRETH1)5:2 0.4774124 0.0663590 7.194 6.27e-13 ***
## DMDEDUC:1 -0.2023967 0.0576221 -3.512 0.000444 ***
## DMDEDUC:2 -0.0237389 0.0146964 -1.615 0.106249
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Names of linear predictors: logitlink(pstr0), loglink(lambda)
##
## Log-likelihood: -9765.52 on 5036 degrees of freedom
##
## Number of Fisher scoring iterations: 8
##
## No Hauck-Donner effect found in any of the estimates
## repwts
repdes = as.svrepdesign(des,type="Fay",fay.rho=0.2)
rep1 = withReplicates(repdes, quote(
coef(zeroinfl(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC|RIDAGEYR+factor(RIDRETH1)+DMDEDUC, weights=.weights))
))
rep1
## theta SE
## count_(Intercept) 1.8340684 0.1350
## count_RIDAGEYR -0.0073881 0.0028
## count_factor(RIDRETH1)2 -0.1073314 0.2471
## count_factor(RIDRETH1)3 0.6551358 0.1857
## count_factor(RIDRETH1)4 0.6358143 0.1438
## count_factor(RIDRETH1)5 0.4774122 0.2501
## count_DMDEDUC -0.0237386 0.0797
## zero_(Intercept) 0.6605043 0.2582
## zero_RIDAGEYR -0.0078333 0.0039
## zero_factor(RIDRETH1)2 -0.1167895 0.2854
## zero_factor(RIDRETH1)3 0.1019712 0.0983
## zero_factor(RIDRETH1)4 -0.1608040 0.0859
## zero_factor(RIDRETH1)5 0.1067790 0.2120
## zero_DMDEDUC -0.2023967 0.0586
repv = withReplicates(repdes, quote(
coef(vglm(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC, zipoisson(), data = nhanes_sxq, crit = "coef",weights=.weights))
))
repv
## theta SE
## (Intercept):1 0.6605042 0.2582
## (Intercept):2 1.8340681 0.1350
## RIDAGEYR:1 -0.0078333 0.0039
## RIDAGEYR:2 -0.0073881 0.0028
## factor(RIDRETH1)2:1 -0.1167894 0.2854
## factor(RIDRETH1)2:2 -0.1073312 0.2471
## factor(RIDRETH1)3:1 0.1019712 0.0983
## factor(RIDRETH1)3:2 0.6551367 0.1857
## factor(RIDRETH1)4:1 -0.1608040 0.0859
## factor(RIDRETH1)4:2 0.6358147 0.1438
## factor(RIDRETH1)5:1 0.1067789 0.2120
## factor(RIDRETH1)5:2 0.4774124 0.2501
## DMDEDUC:1 -0.2023967 0.0586
## DMDEDUC:2 -0.0237389 0.0797
Another way to fit the model using just the survey
package is with svymle
. This takes the log-likelihood and
its derivative as arguments, and adds linear predictors to some or all
of those arguments. That is, we specify the log-likelihood in terms of
the Bernoulli parameter p0 and the Poisson mean
λ – actually logitp0 and η = log λ, and also give
the derivative with respect to these two parameters. The software does
the necessary additional work to put linear predictors on the parameters
and give us the zero-inflated model. In fact, svymle
is
very similar in underlying approach to vglm
; the difference
is that vglm
comes with a large collection of predefined
models.
In defining the loglikelihood I’m going to take advantage of the Poisson pmf being available in R. Let’s call it ϝ(y, λ). The loglikelihood is ℓ(y; μ, p0) = log (p0{y = = 0} + (1 − p)ϝ(y, μ)) only we want it in terms of logitp0 and η = log λ
loglike = function(y,eta,logitp){
mu = exp(eta)
p = exp(logitp)/(1+exp(logitp))
log(p*(y==0)+(1-p)*dpois(y,mu))
}
We also need the derivatives with respect to logitp0 and η = log λ
dlogitp = function(y,eta,logitp){
mu = exp(eta)
p = exp(logitp)/(1+exp(logitp))
dexpit = p/(1+p)^2
num = dexpit*(y==0)-dexpit*dpois(y,mu)
denom = p*(y==0)+(1-p)*dpois(y,mu)
num/denom
}
deta = function(y,eta,logitp){
mu = exp(eta)
p = exp(logitp)/(1+exp(logitp))
dmutoy = 0*y
dmutoy[y>0] = exp(-mu[y>0])*mu[y>0]^(y[y>0]-1)/factorial(y[y>0]-1)
num = (1-p)*(-dpois(y,mu)+dmutoy)
denom = p*(y==0)+(1-p)*dpois(y,mu)
num/denom
}
score = function(y,eta,logitp) cbind(deta(y,eta,logitp), dlogitp(y,eta,logitp))
And now we call svymle
giving the linear predictors for
both parameters. One of the formulas needs to include the response
variable Y.
nlmfit = svymle(loglike=loglike, grad=score, design=des,
formulas=list(eta=malepartners~RIDAGEYR + factor(RIDRETH1) + DMDEDUC,
logitp=~RIDAGEYR + factor(RIDRETH1) + DMDEDUC),
start=coef(unwt), na.action="na.omit",method="BFGS")
summary(nlmfit)
## Survey-sampled mle:
## svymle(loglike = loglike, gradient = score, design = des, formulas = list(eta = malepartners ~
## RIDAGEYR + factor(RIDRETH1) + DMDEDUC, logitp = ~RIDAGEYR +
## factor(RIDRETH1) + DMDEDUC), start = coef(unwt), na.action = "na.omit",
## method = "BFGS")
## Coef SE p.value
## eta.(Intercept) 1.826824375 0.154214263 < 0.001
## eta.RIDAGEYR -0.007800674 0.003014996 0.00967
## eta.factor(RIDRETH1)2 -0.119695528 0.235192608 0.61080
## eta.factor(RIDRETH1)3 0.639831707 0.165176844 < 0.001
## eta.factor(RIDRETH1)4 0.615167757 0.117750557 < 0.001
## eta.factor(RIDRETH1)5 0.465556579 0.213462435 0.02919
## eta.DMDEDUC -0.008130626 0.072679434 0.91093
## logitp.(Intercept) 0.578310456 0.246782589 0.01911
## logitp.RIDAGEYR -0.006077530 0.004017016 0.13029
## logitp.factor(RIDRETH1)2 -0.033442030 0.280701222 0.90517
## logitp.factor(RIDRETH1)3 0.124434633 0.095140201 0.19090
## logitp.factor(RIDRETH1)4 -0.151763506 0.086322700 0.07873
## logitp.factor(RIDRETH1)5 0.119530402 0.209380277 0.56808
## logitp.DMDEDUC -0.209112732 0.053553190 < 0.001
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (30) clusters.
## svydesign(id = ~SDMVPSU, strat = ~SDMVSTRA, weights = ~WTINT2YR,
## nest = TRUE, data = nhanes_sxq)
Finally, we use svy_vglm
, with variances by
linearisation
library(svyVGAM)
svy_vglm(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC, zipoisson(), design=des, crit = "coef")
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (30) clusters.
## svydesign(id = ~SDMVPSU, strat = ~SDMVSTRA, weights = ~WTINT2YR,
## nest = TRUE, data = nhanes_sxq)
##
## Call:
## vglm(formula = formula, family = family, data = surveydata, weights = .survey.prob.weights,
## crit = "coef")
##
##
## Coefficients:
## (Intercept):1 (Intercept):2 RIDAGEYR:1 RIDAGEYR:2
## 0.660504243 1.834068104 -0.007833317 -0.007388071
## factor(RIDRETH1)2:1 factor(RIDRETH1)2:2 factor(RIDRETH1)3:1 factor(RIDRETH1)3:2
## -0.116789371 -0.107331190 0.101971159 0.655136725
## factor(RIDRETH1)4:1 factor(RIDRETH1)4:2 factor(RIDRETH1)5:1 factor(RIDRETH1)5:2
## -0.160804047 0.635814748 0.106778915 0.477412443
## DMDEDUC:1 DMDEDUC:2
## -0.202396659 -0.023738881
##
## Degrees of Freedom: 5050 Total; 5036 Residual
## Log-likelihood: -493703966
and by replicate weights
svy_vglm(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC, zipoisson(), design=repdes, crit = "coef")