Search code examples
rsurvey

The set of variables used for weighing-up changes the resulting estimates


When comparing the effects of the set of variables passed to the svyby function on the resulting estimates and standard errors, I discovered that weighing up a single variable and two variables yields the same estimates, but weighing up multiple variables yields a significantly lower estimate than the other two methods.

What is the reason for that, and how can I avoid this from happening?

Link to the dataset: https://drive.google.com/open?id=1xqFxUBLZifaz57yvoNFOcvhBDGuHuSMq

Here is my code:

library(tidyverse)
library(survey)

load("des2004small.RData")

weighUp <- function(variables) {
  svyby(formula = make.formula(variables), by = ~statefip, 
        design = des2004small,  
        FUN = svytotal, na.rm = TRUE)
}

# Weigh up a single variable:
dfstate2004_singleVariable = weighUp(c("race_acs"))
# Weigh up two variables:
dfstate2004_twoVariables = weighUp(c("race_acs", "cvap_acs"))
# Weigh up multiple variables:
dfstate2004_multipleVariables = weighUp(c("race_acs", "cit_acs", 
                                          "educ_acs", "unemployed_acs", "labforce_acs", "poverty_acs", "cvap_acs"))

# Compare the three diffent methods:
comparison2004 = dfstate2004_singleVariable %>% 
  inner_join(dfstate2004_twoVariables, by = "statefip", suffix = c(".single", ".two")) %>%
  inner_join(dfstate2004_multipleVariables, by = "statefip", suffix = c("", ".multiple"))

race_acswhite2004 = comparison2004 %>% 
  select(statefip, 
         single = race_acswhite.single, 
         two = race_acswhite.two, 
         multiple = race_acswhite)
race_acswhite2004

Here are the resulting differing estimates:

+-------------------------------------+
|   statefip  single     two multiple |
+-------------------------------------+
| 1        1 3084123 3084123  2128346 |
| 2        2  427008  427008   277075 |
+-------------------------------------+

Solution

  • The variables in the 'multiple' table have missing values, and svytotal drops any observations with missing values on any of the variables it is analysing. Well, by default it gives NA results, but if you ask it to drop missing values with na.rm=TRUE it drops them and the whole observation.