Search code examples
rsurvey

Using the survey package to find SE's and crosstabulations


I am using the survey package by Thomas Lumley to create cross tabs and SE's. I am struggling to specify the denominator of the cross tabulation.

This is my data:

    library(survey)
    data <- read_table2("Q50_1   Q50_2   Q38 Q90 pov gender  wgt id
    yes   3   Yes NA   High    M   1.3 A
    NA   4   No  2   Med F   0.4 B
    no   2   NA 4   Low F   1.2 C
    maybe   3   No  2   High    M   0.5 D
    yes   NA   No  NA   High    M   0.7 E
    no   2   Yes 3   Low F   0.56 F
    maybe   4   Yes 2   Med F   0.9 G")
    
    

Create the design object:

    design <- svydesign(id =~id,
                            weights  = ~wgt,
                            nest = FALSE,
                            data = data)

To find the cross tabulation of Q50_1 by Female:

svymean(~interaction(Q50_1,gender=="F"), design, na.rm = T)

This gives me:

                                                 mean     SE
interaction(Q50_1, gender == "F")maybe.FALSE 0.096899 0.1043
interaction(Q50_1, gender == "F")no.FALSE    0.000000 0.0000
interaction(Q50_1, gender == "F")yes.FALSE   0.387597 0.2331
interaction(Q50_1, gender == "F")maybe.TRUE  0.174419 0.1725
interaction(Q50_1, gender == "F")no.TRUE     0.341085 0.2233
interaction(Q50_1, gender == "F")yes.TRUE    0.000000 0.0000

This is not as useful to me because the denominator includes TRUE FALSE values for every combination, whereas I am only interested in the mean that is true. So, I could easily find the percentage of TRUE as follows:

dat <- as.data.frame(svymean(~interaction(Q50_1,gender=="F"), design, na.rm = T)) %>% tibble::rownames_to_column("question")

dat %>%   tidyr::separate(question,c("question",'response'), sep = "\\)", extra = "merge") %>%
    mutate(question = str_replace(question,"interaction\\("," ")) %>%
    tidyr::separate(response,c('value', 'bool'), sep ="\\." ) %>% 
    tidyr::separate(question,c('question', 'group'), sep ="\\," ) %>% 
    tidyr::separate(group,c('group_level', 'group'), sep ="\\==" ) %>% 
    
    filter(bool=='TRUE') %>%
    group_by(question, group_level, group) %>%
    mutate(sum_true = sum(mean)) %>%
    mutate(mean= mean/sum_true)

This gives me:

  question group_level group    value bool   mean    SE sum_true
  <chr>    <chr>       <chr>    <chr> <chr> <dbl> <dbl>    <dbl>
 " Q50_1" " gender "  " \"F\"" maybe TRUE  0.338 0.173    0.516
 " Q50_1" " gender "  " \"F\"" no    TRUE  0.662 0.223    0.516
 " Q50_1" " gender "  " \"F\"" yes   TRUE  0     0        0.516

The means are exactly what I want, but the SEs are associated with a different denominator and don't represent the manipulated mean. Is there a way to call the svymean to present the mean and SE of ONLY the TRUE values in the denominator?

I thought something like this might do (but it does not work):

svymean(~interaction(Q50_1,gender=="F"[TRUE]), design, na.rm = T)

My desired outcome (the SE's are fake):

                                                      mean     SE
interaction(Q50_1, gender == "F"[TRUE])maybe.TRUE  0.338     0.0725
interaction(Q50_1, gender == "F"[TRUE])no.TRUE     0.0.662   0.0233
interaction(Q50_1, gender == "F"[TRUE])yes.TRUE    0.0       0.0000

Solution

  • To get the percentage of women who responded to each answer you want

    svymean(~Q50_1, subset(design, gender== "F"),na.rm=TRUE)
    

    or equivalently (because that's how svyby does it)

    svyby(~Q50_1, ~gender, design, svymean, na.rm = TRUE)
    

    If you want to get the empty category as well, you need to convert the ~Q50_1 variable to a factor -- that's the point of factors (vs strings): they know what levels they have.

    If you to be able to extract parts of the output programmatically, use the coef and SE functions

    data$Q50_1<-factor(data$Q50_1)
    design <- svydesign(id =~id,
                                 weights  = ~wgt,
                                 nest = FALSE,
                                 data = data)
                                 
     svymean(~Q50_1, subset(design, gender== "F"),na.rm=TRUE)
     svyby(~Q50_1, ~gender, design, svymean, na.rm = TRUE)[1,]
    
     coef(svyby(~Q50_1, ~gender, design, svymean, na.rm = TRUE)) 
     SE(svyby(~Q50_1, ~gender, design, svymean, na.rm = TRUE))                             
    

    These don't agree with what you got using ~interaction, because what you got that way doesn't match what you said you want. The interaction analysis gives you the percentage of people who are women and also responded yes, not the percentage of yes responses among women. To put it another way, the 6 percentages you get with the interaction analysis add to 100%, not to 200%.

    > sum(coef(svymean(~interaction(Q50_1,gender=="F"), design, na.rm = T)))
    [1] 1