---
title: "Zero-inflated Poisson model for complex survey data"
author: "Thomas Lumley"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Zero-inflated Poisson model for complex survey data}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---
The Zero-Inflated Poisson model is a model for count data with excess zeros. The response distribution is a mixture of a point mass at zero and a Poisson distribution: if $Z$ is Bernoulli with probability $1-p_0$ and $P$ is Poisson with mean $\lambda$, then $Y=Z+(1-Z)P$ is zero-inflated Poisson. The ZIP is a latent-class model; we can have $Y=0$ either because $Z=0$ ('structural' zeroes) or because $P=0$. That’s natural in some ecological examples: if you didn’t see any salmon it could be because the area is salmon-free (it’s Eden Park) or because you just randomly didn’t see any.  To turn this into a regression model we typically put a logistic regression structure on $Z$ and a Poisson regression structure on $P$. 

There isn’t (as far as I know) existing software in R for design-based inference in zero-inflated Poisson models, so it’s a good example for the benefits of `svyVGAM`.  The `pscl` package (Zeileis et al) fits zero-inflated models, and so does `VGAM`, so we can compare the model fitted with `svyVGAM` to both of those and to other work-arounds.

I’ll do an example with data on number of sexual partners, from NHANES 2003-2004. We will look at questions `SXQ200` and `SXQ100`: number of male sexual partners.  Combining these gives a ‘real’ zero-inflated variable: based on sexual orientation the zeroes would divide into 'never' and 'not yet'.

Here's how I created the dataset, from two NHANES files.  It's `data(nhanes_sxq)` in the package
```
library(foreign)
setwd("~/nhanes")
demo = read.xport("demo_c.xpt")
sxq = read.xport("sxq_c.xpt")
merged = merge(demo, sxq, by='SEQN')
merged$total = with(merged, ifelse(RIAGENDR==2, SXQ100+SXQ130, SXQ170+SXQ200))
merged$total[merged$SXQ020==2] = 0
merged$total[merged$total>2000] = NA
merged$age = merged$RIDAGEYR/25
merged$malepartners=with(merged, ifelse(RIAGENDR==2,SXQ100,SXQ200))
merged$malepartners[merged$malepartners>200]=NA
nhanes_sxq<-merged[, c("SDMVPSU","SDMVSTRA","WTINT2YR","RIDAGEYR","RIDRETH1","DMDEDUC","malepartners")]
```

Start off by loading the packages and setting up a survey design

```{r}
library(svyVGAM)
library(pscl)
data(nhanes_sxq)
des = svydesign(id=~SDMVPSU,strat=~SDMVSTRA,weights=~WTINT2YR, nest=TRUE, data=nhanes_sxq)
```

First, we'll fit the model just ignoring the survey design, using both `pscl::zeroinfl` and `VGAM::vglm`.  These models use the same variables in a logistic regression for $Z$ and a Poisson regression for $P$.  In `VGAM` you would make the models different by constraining coefficients to be zero in one of the models; in `pscl` you would specify different models before and after the `|`.

```{r}
unwt = zeroinfl(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC|RIDAGEYR+factor(RIDRETH1)+DMDEDUC, data=nhanes_sxq)
summary(unwt)

vglm(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC, zipoisson(), data = nhanes_sxq, crit = "coef")
```

### Re-scaling the weights

A traditional work-around for regression models is to rescale the weights to sum to the sample size and then pretend they are precision weights or frequency weights. 


```{r}
nhanes_sxq$scaledwt<-nhanes_sxq$WTINT2YR/mean(nhanes_sxq$WTINT2YR)

wt= zeroinfl(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC|RIDAGEYR+factor(RIDRETH1)+DMDEDUC, data=nhanes_sxq, weights=scaledwt)
summary(wt)

wtv= vglm(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC, zipoisson(), data = nhanes_sxq, crit = "coef",weights=scaledwt)
summary(wtv)
```


```{r warning=FALSE}
## repwts
repdes = as.svrepdesign(des,type="Fay",fay.rho=0.2)
rep1 = withReplicates(repdes, quote( 
    coef(zeroinfl(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC|RIDAGEYR+factor(RIDRETH1)+DMDEDUC, weights=.weights))
    ))
rep1

repv = withReplicates(repdes, quote( 
    coef(vglm(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC, zipoisson(), data = nhanes_sxq, crit = "coef",weights=.weights))
    ))
repv
```

### svymle

Another way to fit the model using just the `survey` package is with `svymle`.  This takes the log-likelihood and its derivative as arguments, and adds linear predictors to some or all of those arguments.  That is, we specify the log-likelihood in terms of the Bernoulli parameter $p_0$ and the Poisson mean $\lambda$ -- actually $\mathrm{logit} p_0$ and $\eta=\log\lambda$, and also give the derivative with respect to these two parameters. The software does the necessary additional work to put linear predictors on the parameters and give us the zero-inflated model.  In fact, `svymle` is very similar in underlying approach to `vglm`; the difference is that `vglm` comes with a large collection of predefined models. 

In defining the loglikelihood I'm going to take advantage of the Poisson pmf being available in R. Let's call it $\digamma(y,\lambda)$. The loglikelihood is 
$$\ell(y; \mu,p_0)=\log\left(p_0\{y==0\}+(1-p)\digamma(y,\mu)\right)$$
only we want it in terms of $\mathrm{logit} p_0$ and $\eta=\log \lambda$
```{r}
loglike = function(y,eta,logitp){
    mu = exp(eta)
    p = exp(logitp)/(1+exp(logitp))
    log(p*(y==0)+(1-p)*dpois(y,mu))
}
```
   
We also need the derivatives with respect to $\mathrm{logit} p_0$ and $\eta=\log \lambda$
   
```{r}    
dlogitp = function(y,eta,logitp){
    mu = exp(eta)
    p = exp(logitp)/(1+exp(logitp))
    dexpit = p/(1+p)^2
    num = dexpit*(y==0)-dexpit*dpois(y,mu)
    denom = p*(y==0)+(1-p)*dpois(y,mu)
    num/denom
    }   
    
deta = function(y,eta,logitp){
    mu = exp(eta)
    p = exp(logitp)/(1+exp(logitp))
    dmutoy = 0*y
    dmutoy[y>0] = exp(-mu[y>0])*mu[y>0]^(y[y>0]-1)/factorial(y[y>0]-1)
    num = (1-p)*(-dpois(y,mu)+dmutoy)
    denom = p*(y==0)+(1-p)*dpois(y,mu)
    num/denom
    }   

score = function(y,eta,logitp) cbind(deta(y,eta,logitp), dlogitp(y,eta,logitp))
```

And now we call `svymle` giving the linear predictors for both parameters.  One of the formulas needs to include the response variable $Y$. 

```{r}
nlmfit = svymle(loglike=loglike, grad=score, design=des, 
        formulas=list(eta=malepartners~RIDAGEYR + factor(RIDRETH1) + DMDEDUC, 
        logitp=~RIDAGEYR + factor(RIDRETH1) + DMDEDUC),
      start=coef(unwt), na.action="na.omit",method="BFGS")

summary(nlmfit)
```

### svyVGAM

Finally, we use `svy_vglm`, with variances by linearisation

```{r}
library(svyVGAM)
svy_vglm(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC, zipoisson(), design=des, crit = "coef")
```

and by replicate weights

```
svy_vglm(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC, zipoisson(), design=repdes, crit = "coef")
```