Search code examples
rstatisticscluster-analysissurvey

R survey poststratification: struggling with the survey function


I'm new here and new to R. I'm wondering whether I've used the R survey package correctly to postStratify my data. Below you can see the data structure of my dataframe (df).

utype gender age regzeit finanz sfeld sindex
pri female 23 ja s ARG 5
sta male 23 nein f ARG -7
sta female 21 ja ARG 11
pri male 28 ja t ARG 1

I've oversampled females for the "gender" variable and students for the "utype" variable and now want to adjust for the distribution in the population. My n=383 was oversampled to n = 477 The intended distributions of my my n=383 sample are:

utype male female Sum
pri 54 68 122
sta 128 133 261
Sum 187 196 383

design <- svydesign(id = ~utype+gender, data = df)

Warning message: In svydesign.default(id = ~utype + gender, data = df) : No weights or probabilities supplied, assuming equal probability

pop.types <- data.frame(utype=c("sta","pri"), Freq=c(261,122))

designp <- postStratify(design, ~utype, pop.types)

postStratify(design, ~utype, pop.types)

svymean(~sindex, design)
mean | SE
sindex 0.48008 | 0.0192
svymean(~sindex, designp)
mean | SE
sindex 0.47692 | 0

My question now is whether the below code is correct and how I can postStratify for both variables utype and gender in my code or whether I have to run the postStratify command twice. I'm especially concerned that the standard error is zero in my weighted sample and because of the warning message. And whether the Freq values are correct for what I'm trying to do here?

The last thing I've been trying to figure out is how to get the svymean, svyhist or svyboxplot functions for "sindex", but only for the observations with utype == pri, so by group basically. This should all be applied to the weighted sindex values.

I hope I'm conforming to all the rules. Many thanks in advance!


Solution

  • You don't want to post-stratify twice (that would give you raking). You want to post-stratify once, using a post-stratum variable that is a combination of your two variables gender, as in your 2x2 table. That is,

    designp <- update(designp, combined_var = interaction(gender,utype))
    

    You now specify a pop.types= argument that has the desired frequencies for each of the four levels of this new variable.

    With only the four observations you list, you will end up with zero standard errors because there isn't any variation within any of the post-strata.