---
title: "VGAMs for survey data: theory"
author: "Thomas Lumley"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{VGAMs for survey data: theory}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

The **VGAM** package's `vglm()` function, like the **survey** package's `svymle()` function, allows for maximum likelihood fitting where linear predictors are added to one or more parameters of a distribution --- but  `vglm()` is a lot faster and has many distributions already built in. This is how  we make **svyVGAM** handle complex sampling.


I will write $\beta$ for the regression parameters, $\theta$ for the base parameters of the response distribution, and $\eta$ for the linear predictors. So, for example, in a classical linear model there would be two parameters $\theta$: the mean ($\theta_1$) and variance ($\theta_2$). The mean would have a set of regression parameters and the variance would have a single parameter. Collectively, these would be $\beta$ (maybe $\beta_{11}\dots\beta_{1p}$ and $\beta_{21}$), and the two combinations that are plugged in as $\theta$ would be called $\eta_1$ and $\eta_2$.  The big advantage of **VGAM** is that it does a lot of the work for the user: while the user can add new families, there are many pre-prepared ones, and there are built-in ways to constrain parameters to be equal or related in some other way.

To provide survey versions of `vglm()`, we need to (a) get design-consistent point estimates out of `vglm()`, and (b) construct design-based standard errors for the fit.  The first is easy: `vglm()` accepts frequency weights, which are [equivalent to sampling weights for point estimation](https://notstatschat.rbind.io/2020/08/04/weights-in-statistics/) with independent observations. 

The second can be done in two ways: resampling (which is straightforward, if potentially slow), and linearisation.  Linearisation requires computing the influence functions of the parameters
$$h_i(\beta) = -\widehat{\cal I}^{-1}_w U_i(\beta)$$
where $\widehat{\cal I}_w$ is the weighted estimate of the population Fisher information,  $U_i=\partial_\beta \ell_i(\beta)$ is the loglikelihood contribution of the $i$th observation, and $w_i$ is its weight.  The influence functions have the property
$$\hat\beta-\beta_0 = \sum_i w_i h_i(\beta_0)+o_p(\|\hat\beta-\beta_0\|)$$
so that the variance of $\hat\beta$ is asymptotically the variance of the population total of the influence functions. 
 The survey package provides a function `svyrecvar()` to estimate standard errors given the influence functions, or (a bit less efficiently) you can just call `svytotal()`.

### Resampling

A design object of class `svyrep.design` contains sets of replicate weights analogous to jackknife or bootstrap replicates.  We need to call `vglm()` with each set of weights.  It should be helpful to specify the full-sample estimates as starting values. 

One complication is that sets of replicate weights will typically include some zeroes, which `vglm()` does not allow (eg, a bootstrap or jackknife resample will  omit some observations). We set these to $10^{-9}$ times the maximum weight, which has the desired effect that the observations are present in the fit but with (effectively) zero weight. 

### Linearisation

The `cov.unscaled` slot of a `summary.vglm` object contains the inverse of the estimated population Fisher information, $\widehat{\cal I}^{-1}_w$. 

The `vglm` object  provides $\partial_\eta\ell_i(\eta)$ for the base parameters $\theta$, and also the model matrices that specify $\partial_\beta\eta$, so we can construct $U_i$. We need to take care with the constraints, which can cause a coefficient $\beta$ to appear in more than one linear predictor.

Suppose $\beta_x$ appears in both $\eta_1$ and $\eta_2$, with $x$ values $x_1$ and $x_2$.  The chain rule tells us
$$\partial_{\beta_x}\ell_i =\partial_{\eta_1}\ell_i\partial_{\beta_x}\eta_1 + \partial_{\eta_2}\ell_i\partial_{\beta_x}\eta_2 = (\partial_{\eta_1}\ell_i) x_{1i}+ (\partial_{\eta_2}\ell_i) x_{2i} $$
We might have $x_1\equiv x_2\,(=x)$, but that just means
$$\partial_{\beta_x}\ell_i = (\partial_{\eta_1}\ell_i) x_{i}+ (\partial_{\eta_2}\ell_i) x_{i} $$

The constraint matrix $C$ consists of 1s and 0s.  If there are $p$ parameters in $M$ equations the matrix is $M\times p$, with $C_{jk}=1$ if parameter $k$ is in linear predictor $j$. In the default, unconstrained setup, the constraint matrix consists of an $M\times M$ identity matrix for each parameter, pasted columnwise to give a $M\times pM$ matrix.  In the proportional odds model, as another example, there are separate intercepts for each linear predictor but the other parameters all appear in every linear predictor. The first $M\times M$ block is the identity, and the rest of the matrix is a column of 1s for each predictor variable. Another way to say this is that $C_{jk}=\partial_{ (\beta_kx_k)}\eta_j$


So, if we want $\partial\beta\ell_i$, the chain rule says
\begin{eqnarray*}
\frac{\partial \ell_i}{\partial \beta_j} &=& \sum_k\frac{\partial \ell_i}{\partial\eta_k} \frac{\partial \eta_k}{\partial \beta_j}\\
&=& \sum_k\frac{\partial \ell_i}{\partial \eta_k} \frac{\partial \eta_k}{\partial (x\beta)_j}\frac{\partial (x\beta)_j}{\partial \beta_j}\\
&=&\sum_k \frac{\partial \ell_i}{\partial \eta_k}  C_{kj}x_{ij}
\end{eqnarray*}

There is one further complication. The `model.matrix` method for `vglm` objects returns a model matrix of dimension $Mn\times p$ rather than $n\times p$, so we need to sum over the rows for each observation, which can be identified from the row names, and then rescale.  The right standardisation appears to come from defining
$$\tilde C_{kj}=\frac{C_{kj}}{\sum_k C_{kj}}$$
and then 
$$\frac{\partial \ell_i}{\partial \beta_j}=\sum_k \frac{\partial \ell_i}{\partial \eta_k}  \tilde C_{kj}x_{ikj}.$$