Search code examples
rformulasurveychi-squared

How can I get my R formula to correctly parse a true/false statement as one of its arguments?


I have a fictional weighted survey dataset that shows how responses to the question "I enjoy driving fast" vary by respondents' car colors. Here's a sample of the original dataset:

Car_Color   Weight  Enjoy_Driving_Fast
White   0.0002849   Slightly Disagree
Red     0.0010247   Slightly Disagree
Black   0.0046459   Strongly Agree
Red     0.0048461   Strongly Agree
Red     0.0060173   Strongly Agree
Black   0.0062723   Agree
Red     0.0083730   Strongly Agree
Black   0.0115573   Strongly Agree
Black   0.0131331   Strongly Agree
White   0.0156400   Strongly Agree
White   0.0201834   Slightly Agree
White   0.0209492   Strongly Disagree

And here's a copy of my code that imports this dataset, then converts it to a survey design object:

library(tidyverse)
library(survey)
library(srvyr)
library(fastDummies)

df_car_survey <- read_csv(
'https://raw.githubusercontent.com/kburchfiel/car_survey_data/refs/heads/main/car_survey.csv') 
car_survey_des <- df_car_survey %>% as_survey_design(
  weights = 'Weight')

I am working on a post-hoc chi squared test that will determine whether the proportion of red car owners who agreed to this question differs from the corresponding proportion of white car owners. Because my data is stored within a survey design object, this test will be conducted using the survey library's svychisq() function.

I tried to run this chi squared test using the following code (which is based on Chapter 6 of Exploring Complex Survey Data Analysis Using R):

chi2_car_color_agreement_red_white_agree <- car_survey_des %>% filter(
  Car_Color %in%
c("Red", "White")) %>% drop_na(Car_Color) %>% svychisq(
  formula = ~ Car_Color + (Enjoy_Driving_Fast == "Agree"),
  design = .,
  statistic = "Chisq",
  na.rm = TRUE
)

However, I received the following error:

Error in `[.data.frame`(design$variables, , as.character(cols)) : 
  undefined columns selected

I think the issue here is with the (Enjoy_Driving_Fast == "Agree") component of the formula. Is there a way to modify that component in order to make it compatible with R's formula logic?

I was able to get around this issue by creating a dummy variable that indicates whether or not the respondent chose 'Agree' as their response to the "I enjoy driving fast" question, then passing that variable to the formula in place of (Enjoy_Driving_Fast == "Agree"). Nevertheless, I would like to find a way to get the original formula to work so that I can skip the dummy variable creation step.


Solution

  • By looking at the source code of the two functions, svychisq appears to be unable to handle (Enjoy_Driving_Fast == "Agree").

    Using the anes_2020 dataset as described in the book to avoid confounding issues, we can see what happens when we step through their code:

    renv::install("tidy-survey-r/srvyrexploR")
    library(dplyr)
    library(tidyr)
    library(survey)
    library(srvyr)
    library(srvyrexploR)
    
    data("anes_2020")
    targetpop <- 231592693
    
    anes_adjwgt <- anes_2020 %>%
      mutate(Weight = Weight / sum(Weight) * targetpop)
    
    anes_des <- anes_adjwgt %>%
      as_survey_design(
        weights = Weight,
        strata = Stratum,
        ids = VarUnit,
        nest = TRUE
      )
    # this works as expected
    anes_des %>%
      svychisq(
        formula = ~ TrustGovernment + TrustPeople,
        design = .,
        statistic = "Wald",
        na.rm = TRUE
      )
    

    We can break things in the same way:

    anes_des %>%
      svychisq(
        formula = ~ TrustGovernment + (TrustPeople=="Some of the time"),
        design = .,
        statistic = "Wald",
        na.rm = TRUE
      )
    Error in `[.data.frame`(design$variables, , as.character(cols)) : 
      undefined columns selected
    

    Now onto the source code.

    surveychisq.R

    svychisq.survey.design<-function(formula, design,
                       statistic=c("F","Chisq","Wald","adjWald","lincom","saddlepoint","wls-score"),
                       na.rm=TRUE,...){
    
    # yadda
    cols<-formula[[2]][[3]]
    # yadda
    colvar<-unique(design$variables[,as.character(cols)])
    # yadda
    
    }
    

    When formula = ~ TrustGovernment + TrustPeople:

    form = as.formula(~ TrustGovernment + TrustPeople, env = anes_des)
    colvar <- as.character(form[[2]][[3]])
    colvar
    [1] "TrustPeople"
    anes_des$variables[1:5 , colvar]
    # A tibble: 5 × 1
      TrustPeople        
      <fct>              
    1 About half the time
    2 Some of the time   
    3 Some of the time   
    4 Most of the time   
    5 Some of the time   
    

    surveychisq simply turns the two terms of the formula (which are the second and third elements since + is the first) into character variables and selects from design$variable using base R column subsetting, df[ , "col"]. Returning to the formula that breaks the function:

    form = as.formula(~ TrustGovernment + (TrustPeople=="Some of the time"), env = anes_des)
    colvar <- as.character(form[[2]][[3]])
    colvar
    [1] "("                                   "TrustPeople == \"Some of the time\""
    

    We actually get a length 2 character vector, of which neither are columns in design$variables.

    anes_des$variables[1:5 , colvar]
    Error in `anes_des$variables[1:5, colvar]`:
    ! Can't subset columns that don't exist.
    ✖ Columns `(` and `TrustPeople == "Some of the time"` don't exist.
    

    So why does it work in svyttest? Because the author implemented the function significantly differently. Note the use of eval(bquote( in svyttest.R

    eval(bquote(# blah)
    

    R's non-standard evaluation is complicated enough to write a book on, so details are way outside the scope of this question.

    bquote

    eval