Search code examples
rsurveyconfidence-interval

How can I get CIs around row proportions in the "survey" package?


[Edited 1/4 to specify that CIs should be bounded by 0 and 1]

The Goal

I have survey data using a complex survey design (from the California Health Interview Survey, CHIS) and am trying to use the survey package to obtain estimates of proportions for each of four sexual orientation categories (straight, gay/lesbian, bisexual, and other) in all 58 counties in California using the survey package.

So, for example, in Alameda County, 92% of its residents might be straight; 3.5%, gay or lesbian; 3.5%, bisexual; and 1%, "other." The goal is to get a confidence interval around each of those proportions (e.g., 92% [91.3%, 92.7%]), for all counties.

Note: Since we're dealing with probabilities, the confidence limits should neither exceed 1 nor fall below 0.

Reprex

Here's a reproducible example:

#Packages
require(survey)
require(dplyr)
require(tidyverse)

#Create data frame
# Set the number of trials and the probabilities for each outcome
set.seed(130)
probs <- c(.94, .03, .02, .01)
n <- 2000

sogi <- data.frame(t(rmultinom(n, 1, prob = probs)))
names(sogi) <- c("straight","gay","bisexual","other")
df_sogi <- sogi %>% mutate(sexual_orientation = 
                             as.factor(case_when(
                               straight %in% 1 ~ 0,
                               gay %in% 1 ~ 1,
                               bisexual %in% 1 ~ 2,
                               other %in% 1 ~ 3))) %>%
  mutate(pw = runif(n, 0.2, 3.5)) %>%
  mutate(county = as.factor(rep(LETTERS[1:10], each = n/length(LETTERS[1:10])))) %>%
  mutate(ind = row_number()) %>%
  select(ind, county, sexual_orientation, pw) 

#Examine raw frequencies
table(df_sogi$county, df_sogi$sexual_orientation)

#Make survey design and survey design replicate objects
df_svy_z1 <- svydesign(id=~ind, data = df_sogi, weights = ~pw, strata = ~county)

Solution Attempt #1

Here's the closest I've gotten, which uses svyglm:

#w/ svyglm
s <- svyglm(sexual_orientation == 0 ~ county, design = df_svy_z1, family="quasibinomial")
a <- predict(s, newdata=df_sogi, type="link", se.fit=TRUE)
b <- unique(confint(a))
exp(b)/(1+exp(b))

> exp(b)/(1+exp(b))
         2.5 %    97.5 %
1    0.8525092 0.9479540
201  0.8641069 0.9584830
401  0.8766525 0.9607483
601  0.8552640 0.9504926
801  0.8743262 0.9592262
1001 0.9143716 0.9814168
1201 0.8587895 0.9514747
1401 0.9316797 0.9862013
1601 0.8660052 0.9532556
1801 0.8633378 0.9532417

I would then write a function that loops over each of the values of the sexual_orientation variable {0, 1, 2, 3}--not an easy task, given that (for me, anyway) invoking named objects and their values is devilishly difficult in survey (or maybe just R generally).

More vexingly yet, I'm not at all sure that the confidence intervals obtained by this method are correct; they seem way too wide.

Solution Attempt #2

An alternative strategy could use svymean:

s <- svymean(~interaction(sexual_orientation, county), design=df_svy_z1)
x <- svycontrast(s, quote(log(`interaction(sexual_orientation, county)0.A` /
                                  (1-`interaction(sexual_orientation, county)0.A`))))
y <- confint(x)
z <- 1/(1+exp(-y))

The two big problems here are that (1) the proportions are cell proportions, not row proportions--i.e., frequencies expressed as a proportion of the entire sample, not of the sample within a given county--and (2) looping over the named objects involves replacing the expression quote(log(`interaction(sexual_orientation, county)0.A`/(1-`interaction(sexual_orientation, county)0.A`)))) with a more general expression for each of the 40 named values in the object s, which I haven't figured out how to do yet.

Similar, but Different, Post

My question is similar to this one:
Is there a way to estimate multinomial proportions with confidence intervals using R survey package?

But the answers there, graciously provided by Prof. @Thomas Lumley, only deal with marginal probabilities--that is, with proportions of straight, gay/lesbian, etc., individuals for the entire state, and not for each county.


Solution

  • There may be a more elegant solution, but you can try using svyby here to first get the means by county:

    library(survey)
    
    svmeans <- svyby(~ sexual_orientation, 
                     design = df_svy_z1, 
                     by = ~ county, 
                     FUN = svymean)
    
    #   county sexual_orientation0 sexual_orientation1 sexual_orientation2 sexual_orientation3 se.sexual_orientation0 se.sexual_orientation1
    # A      A           0.9398657          0.02097910         0.023836687         0.015318555             0.01794317             0.01066598
    # B      B           0.9433422          0.02165069         0.033570977         0.001436098             0.01845545             0.01154185
    # C      C           0.9327511          0.03279607         0.022840837         0.011612004             0.01966055             0.01504690
    # ....
    

    Then if you are ok with a standard Wald CI using the standard error, wrap that in confint()

    svCIs <- confint(svmeans)
    
    #                               2.5 %      97.5 %
    # A:sexual_orientation0  9.046977e-01 0.975033618
    # B:sexual_orientation0  9.071702e-01 0.979514244
    # C:sexual_orientation0  8.942171e-01 0.971285064
    # D:sexual_orientation0  8.881002e-01 0.969468334
    # ....
    

    The returned objects are a bit unwieldy, so to make sense of it all, you can pretty up the objects and merge them:

    library(dplyr)
    library(tidyr)
    
    svmeans_long <- pivot_longer(svmeans[,-grep("se\\.", names(svmeans))], -county, 
                                 names_to = "sexual_orientation",
                                 values_to = "mean")
    
    svCIs_long <- data.frame(svCIs) %>%
      tibble::rownames_to_column() %>%
      separate_wider_delim(rowname, delim = ":", 
                           names = c("county", "sexual_orientation"))
    
    final_data <- left_join(svmeans_long, svCIs_long)
    
    #   county sexual_orientation     mean     X2.5.. X97.5..
    #   <chr>  <chr>                 <dbl>      <dbl>   <dbl>
    # 1 A      sexual_orientation0 0.940    0.905     0.975  
    # 2 A      sexual_orientation1 0.0210   0.0000742 0.0419 
    # 3 A      sexual_orientation2 0.0238   0.00390   0.0438 
    # 4 A      sexual_orientation3 0.0153  -0.00590   0.0365 
    # 5 B      sexual_orientation0 0.943    0.907     0.980  
    # ....
    

    EDIT

    Based on the comments/edit, if the proportions are close to 0 or 1, the Wald CI may not be appropriate. In this case you can still use svyby but use svyciprop for each level of sexual_orientation. The function svyciprop allows for various types of confidence intervals to be calculated.

    Here is an example looping through all the levels of sexual_orientation and using the xlogit CI option):

    svList <- list()
    
    for(xx in sort(unique(df_sogi$sexual_orientation))){
      svList[[xx]]  <- svyby(~ I(sexual_orientation == xx),
            design = df_svy_z1,
            by = ~ county,
            FUN = svyciprop,
            vartype = "ci",
            method = "xlogit") # or "beta" or "logit" or "likelihood", etc see `?svyciprop`
      # tweak the output for convenience
      svList[[xx]]$sexual_orientation <- xx
      names(svList[[xx]])[2] <- "mean"
    }
    
    # combine it all together and order according to county
    do.call(rbind, svList)[order(do.call(rbind, svList)$county),]
    
    #     county        mean         ci_l       ci_u sexual_orientation
    # 0.A      A 0.911193691 0.8520671385 0.94812694                  0
    # 1.A      A 0.036149402 0.0149796639 0.08466537                  1
    # 2.A      A 0.042363554 0.0196451241 0.08897020                  2
    # 3.A      A 0.010293353 0.0018801543 0.05430515                  3
    # 0.B      B 0.923758035 0.8636429959 0.95863968                  0
    # 1.B      B 0.036042649 0.0141337843 0.08885182                  1
    # 2.B      B 0.038706656 0.0164219903 0.08851038                  2
    # 3.B      B 0.001492660 0.0002056946 0.01074521                  3
    # ....