Choosing variances in general linear models 

 Today I'm going to talk about a particular problem from my own research and will outline a method for choosing variances in general linear models (GLMs), but I am also asking a question. 


 The standard setup of GLMs is (roughly) the following.  One hypothesizes that the conditional mean of the outcome variable (y), E[y|x], can be expressed as a function of a linear predictor x'b, or: 
 E[y|x]=μ(x'b).  

 The function μ is referred to as the link function.  Common choices for μ include both the identity and log link.  One common question is why one would choose to use a GLM with, for example, a log link instead of estimating via OLS the regression model: 
 ln(y)=x'b+e.  

 There are two principle objections to the OLS method.  First, in the presence of heteroskedasticity it is difficult to transform predicted values of ln(y) into predicted values of y, although it is possible.  Second, the OLS method throws out any data coming from observations with y=0. 

 Unfortunately, the choice of a link function is comparatively easy (in my view) compared with the next step of choosing an appropriate function for the variance of y given x, which must be prespecified in most GLMs*.  In my work I have focused on choosing variance functions that are proportional to some power of the variance: 
 v(y|x)=α*(μ(x'b))^k.  

 The trick, then, is to choose the correct power with various powers of the mean corresponding to Poisson (k=1), Gamma (k=2), and Wald (k=3), for example.  In health econometrics this can be accomplished by using a modified Park test (due to Manning and Mullahy).  In this procedure one first computes tentative parameter estimates for a GLM based on one's prior beliefs about the appropriate variance function (I typically use Gamma-like regressions for this).  The linear predictors from the tentative regression can be used to get raw-scale residuals by applying the inverse link function.  The modified Park test is to then regress the squared raw-scale residuals on a constant and the linear predictor in a GLM with a log link and the coefficient on the linear predictor then indicates which variance structure is most appropriate. 

 Now for the question.  In health utilization data one often has data with a large number of zeros, for example, less than 10% of my sample uses mental health services in any given year.  While GLMs are typically well behaved, in the presence of so many zeros this need not be the case.  One common practice is to then use a "two part" model in which one uses an initial probit or logit regression to estimate the probability of any utilization and then estimate the second stage GLM model among users only.  My question relates to the appropriate sample to use for the modified Park test--users or everybody?  It turns that in this case it matters since when I look at everyone I get evidence in support of Gamma-like regressions (i.e. k=2 in my Park test), but when I only consider users in the Park test I get estimates of k=2.6, or so, which is more consistent with Wald-type variances. 

 My strong suspicion is that the latter approach is more appropriate since the GLM is only estimated among users, but I've hunted in the literature and found no specific advice on this point and many examples that seem to indicate that the test should be done on everybody. 

 * One exception is the Extended Estimating Equations method proposed by Basu and Rathouz (implemented as pglm in Stata).