Search code examples
rsurveyvariance

Using "adjust" option for lonely PSUs in R survey package: why don't single-PSU strata variances change when data in other strata change?


I have survey data from a stratified simple random sampling design where some strata contain only a single sampling unit (even though the strata population size may be >1). These are referred to as "lonely PSUs" in the R survey package (http://r-survey.r-forge.r-project.org/survey/exmample-lonely.html). There are several options for dealing with this situation and the one I'm interested in is the "adjust" option.

The documentation for options(survey.lonely.psu="adjust") states that

"the data for the single-PSU stratum are centered at the sample grand mean rather than the stratum mean."

In experimenting with this option I had expected that the variance for my single-PSU stratum would change if I changed the data in another stratum but that wasn't the case. Here is a small example:

library(survey)
options(survey.lonely.psu="adjust")

# sample 1
dat1 <- data.frame(N = c(3, 3, 2), h = c(1, 1, 2), y = c(2, 6, 15))
survey1 <- svydesign(~1, fpc = ~N, strata = ~h, data = dat1)
svyby(~y, by = ~h, design = survey1, FUN = svytotal)

In the results note the standard error for strata 2, which is the single-PSU stratum:

  h  y        se
1 1 12  3.464102
2 2 30 21.213203

Now if I change the data in strata 1 like so

# sample 2
(dat2 <- data.frame(N = c(3, 3, 2), h = c(1, 1, 2), y = c(200, 600, 15)))
(survey2 <- svydesign(~1, fpc = ~N, strata = ~h, data = dat2))
svyby(~y, by = ~h, design = survey2, FUN = svytotal)

The results change as expected for stratum 1, but the standard error is still the same for stratum 2

  h    y       se
1 1 1200 346.4102
2 2   30  21.2132

Am I mis-interpreting what the documentation means or might this be a bug?

FYI, here is my sessionInfo:

R version 3.1.3 (2015-03-09)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
other attached packages:
[1] survey_3.30-3

EDIT

My interpretation of the initial answers I received was that the variance adjustment doesn't take effect when the data are subset using the svyby function. However, when I compare the variances of the totals for the stratified population with the variance for the total population it appears that, just as when there are no lonely PSUs, the variance of the total is simply the variance of the independently sampled stratum variances:

> vcov(svyby(~y, by = ~h, design = survey1, FUN = svytotal))
   1   2
1 12   0
2  0 450

> vcov(svytotal(~y, survey1))
    y
y 462

It seems that the latter variance should be different if there is some sort of centering to the grand mean going on when all the data are combined.

As a related question, here is a comparison of svyby when estimating means versus totals:

> svyby(~y, by = ~h, design = survey1, FUN = svymean)
  h  y    se
1 1  4 1.155
2 2 15 0.000

> svyby(~y, by = ~h, design = survey1, FUN = svytotal)
  h  y     se
1 1 12  3.464
2 2 30 21.213

I'm confused here as to why there is a variance estimated for stratum 2 (which contains a lonley PSU) when estimating a total but not when estimating a mean.


Solution

  • Short Answer

    TL;DR: The survey package currently has a bug with the 'adjust' option when using svytotal() but works fine for svymean(). This blog post gets into the details: https://www.practicalsignificance.com/posts/bugs-with-singleton-strata/

    Long Answer

    Currently, with the 'adjust' option, the survey package is just centering the lonely PSU around 0 and then adding the squares to the estimated covariance matrix. This is totally reasonable for svymean(), svyratio(), and many other functions, but it is a mistake for svytotal().

    Why?

    Background on how the survey package estimates variances using influence functions

    The R package estimates variances using the method of linearization based on influence functions. This blog post explains what that means and gives a few references:

    https://www.practicalsignificance.com/posts/survey-covariances-using-influence-functions/#how-does-this-work

    Essentially, for a statistic of interest, we calculate its variance by expressing it as a weighted total of influence functions and estimating the variance of that weighted total. For statistics such as means or ratios, the weighted total of the influence functions is always 0. Totals are a special case, where the influence functions are equal to the original variable, and so their sum is generally nonzero.

    What survey.lonely.psu = 'adjust' is supposed to do

    The documentation for the "adjust" option says that it works by "recentering" the lonely PSU. If the PSU wasn't lonely, we could get a reasonable estimate of its variance contribution by calculating the squared difference between the PSU's total and the average PSU total in its stratum. But because it's the only PSU in its stratum, that difference is zero. Theoretically, with survey.lonely.psu = 'adjust', we instead estimate its variance contribution by calculating the squared difference between the PSU's total and average PSU total across all strata.

    What the survey package actually does and why that's a problem for svytotal()

    Because the survey package uses the method of linearization based on influence functions, the average PSU total across all strata is zero for every statistic other than totals. I think because of this fact, the survey package was written so that when the user specifies survey.lonely.psu = 'adjust', the variance contribution of the lonely PSU is taken by just squaring the PSU total (i.e. subtracting zero). That is clever and works for means, ratios, etc. But it fails for totals because of the unique property that the total of their influence functions is nonzero.

    https://www.practicalsignificance.com/posts/bugs-with-singleton-strata/