When comparing the effects of the set of variables passed to the svyby
function on the resulting estimates and standard errors, I discovered that weighing up a single variable and two variables yields the same estimates, but weighing up multiple variables yields a significantly lower estimate than the other two methods.
What is the reason for that, and how can I avoid this from happening?
Link to the dataset: https://drive.google.com/open?id=1xqFxUBLZifaz57yvoNFOcvhBDGuHuSMq
Here is my code:
library(tidyverse)
library(survey)
load("des2004small.RData")
weighUp <- function(variables) {
svyby(formula = make.formula(variables), by = ~statefip,
design = des2004small,
FUN = svytotal, na.rm = TRUE)
}
# Weigh up a single variable:
dfstate2004_singleVariable = weighUp(c("race_acs"))
# Weigh up two variables:
dfstate2004_twoVariables = weighUp(c("race_acs", "cvap_acs"))
# Weigh up multiple variables:
dfstate2004_multipleVariables = weighUp(c("race_acs", "cit_acs",
"educ_acs", "unemployed_acs", "labforce_acs", "poverty_acs", "cvap_acs"))
# Compare the three diffent methods:
comparison2004 = dfstate2004_singleVariable %>%
inner_join(dfstate2004_twoVariables, by = "statefip", suffix = c(".single", ".two")) %>%
inner_join(dfstate2004_multipleVariables, by = "statefip", suffix = c("", ".multiple"))
race_acswhite2004 = comparison2004 %>%
select(statefip,
single = race_acswhite.single,
two = race_acswhite.two,
multiple = race_acswhite)
race_acswhite2004
Here are the resulting differing estimates:
+-------------------------------------+
| statefip single two multiple |
+-------------------------------------+
| 1 1 3084123 3084123 2128346 |
| 2 2 427008 427008 277075 |
+-------------------------------------+
The variables in the 'multiple' table have missing values, and svytotal
drops any observations with missing values on any of the variables it is analysing. Well, by default it gives NA
results, but if you ask it to drop missing values with na.rm=TRUE
it drops them and the whole observation.