The VGAM package’s vglm()
function, like the survey package’s
svymle() function, allows for maximum likelihood fitting
where linear predictors are added to one or more parameters of a
distribution — but vglm() is a lot faster and has many
distributions already built in. This is how we make
svyVGAM handle complex sampling.
I will write \(\beta\) for the regression parameters, \(\theta\) for the base parameters of the response distribution, and \(\eta\) for the linear predictors. So, for example, in a classical linear model there would be two parameters \(\theta\): the mean (\(\theta_1\)) and variance (\(\theta_2\)). The mean would have a set of regression parameters and the variance would have a single parameter. Collectively, these would be \(\beta\) (maybe \(\beta_{11}\dots\beta_{1p}\) and \(\beta_{21}\)), and the two combinations that are plugged in as \(\theta\) would be called \(\eta_1\) and \(\eta_2\). The big advantage of VGAM is that it does a lot of the work for the user: while the user can add new families, there are many pre-prepared ones, and there are built-in ways to constrain parameters to be equal or related in some other way.
To provide survey versions of vglm(), we need to (a) get
design-consistent point estimates out of vglm(), and (b)
construct design-based standard errors for the fit. The first is easy:
vglm() accepts frequency weights, which are equivalent
to sampling weights for point estimation with independent
observations.
The second can be done in two ways: resampling (which is
straightforward, if potentially slow), and linearisation. Linearisation
requires computing the influence functions of the parameters \[h_i(\beta) = -\widehat{\cal I}^{-1}_w
U_i(\beta)\] where \(\widehat{\cal
I}_w\) is the weighted estimate of the population Fisher
information, \(U_i=\partial_\beta
\ell_i(\beta)\) is the loglikelihood contribution of the \(i\)th observation, and \(w_i\) is its weight. The influence
functions have the property \[\hat\beta-\beta_0 = \sum_i w_i
h_i(\beta_0)+o_p(\|\hat\beta-\beta_0\|)\] so that the variance of
\(\hat\beta\) is asymptotically the
variance of the population total of the influence functions. The survey
package provides a function svyrecvar() to estimate
standard errors given the influence functions, or (a bit less
efficiently) you can just call svytotal().
A design object of class svyrep.design contains sets of
replicate weights analogous to jackknife or bootstrap replicates. We
need to call vglm() with each set of weights. It should be
helpful to specify the full-sample estimates as starting values.
One complication is that sets of replicate weights will typically
include some zeroes, which vglm() does not allow (eg, a
bootstrap or jackknife resample will omit some observations). We set
these to \(10^{-9}\) times the maximum
weight, which has the desired effect that the observations are present
in the fit but with (effectively) zero weight.
The cov.unscaled slot of a summary.vglm
object contains the inverse of the estimated population Fisher
information, \(\widehat{\cal
I}^{-1}_w\).
The vglm object provides \(\partial_\eta\ell_i(\eta)\) for the base
parameters \(\theta\), and also the
model matrices that specify \(\partial_\beta\eta\), so we can construct
\(U_i\). We need to take care with the
constraints, which can cause a coefficient \(\beta\) to appear in more than one linear
predictor.
Suppose \(\beta_x\) appears in both \(\eta_1\) and \(\eta_2\), with \(x\) values \(x_1\) and \(x_2\). The chain rule tells us \[\partial_{\beta_x}\ell_i =\partial_{\eta_1}\ell_i\partial_{\beta_x}\eta_1 + \partial_{\eta_2}\ell_i\partial_{\beta_x}\eta_2 = (\partial_{\eta_1}\ell_i) x_{1i}+ (\partial_{\eta_2}\ell_i) x_{2i} \] We might have \(x_1\equiv x_2\,(=x)\), but that just means \[\partial_{\beta_x}\ell_i = (\partial_{\eta_1}\ell_i) x_{i}+ (\partial_{\eta_2}\ell_i) x_{i} \]
The constraint matrix \(C\) consists of 1s and 0s. If there are \(p\) parameters in \(M\) equations the matrix is \(M\times p\), with \(C_{jk}=1\) if parameter \(k\) is in linear predictor \(j\). In the default, unconstrained setup, the constraint matrix consists of an \(M\times M\) identity matrix for each parameter, pasted columnwise to give a \(M\times pM\) matrix. In the proportional odds model, as another example, there are separate intercepts for each linear predictor but the other parameters all appear in every linear predictor. The first \(M\times M\) block is the identity, and the rest of the matrix is a column of 1s for each predictor variable. Another way to say this is that \(C_{jk}=\partial_{ (\beta_kx_k)}\eta_j\)
So, if we want \(\partial\beta\ell_i\), the chain rule says \[\begin{eqnarray*} \frac{\partial \ell_i}{\partial \beta_j} &=& \sum_k\frac{\partial \ell_i}{\partial\eta_k} \frac{\partial \eta_k}{\partial \beta_j}\\ &=& \sum_k\frac{\partial \ell_i}{\partial \eta_k} \frac{\partial \eta_k}{\partial (x\beta)_j}\frac{\partial (x\beta)_j}{\partial \beta_j}\\ &=&\sum_k \frac{\partial \ell_i}{\partial \eta_k} C_{kj}x_{ij} \end{eqnarray*}\]
There is one further complication. The model.matrix
method for vglm objects returns a model matrix of dimension
\(Mn\times p\) rather than \(n\times p\), so we need to sum over the
rows for each observation, which can be identified from the row names,
and then rescale. The right standardisation appears to come from
defining \[\tilde C_{kj}=\frac{C_{kj}}{\sum_k
C_{kj}}\] and then \[\frac{\partial
\ell_i}{\partial \beta_j}=\sum_k \frac{\partial \ell_i}{\partial
\eta_k} \tilde C_{kj}x_{ikj}.\]