I am trying to identify the best way to run a one-way Anova on a complex survey design. After perusing Lumley's Survey package documentation, I am none the wiser.
The survey::anova function is meant to 'Fit and compare hierarchical loglinear models for complex survey data', which is not what I am doing.
What I am trying to do I have collected data about one categorical independent variable [3 levels] and one quantitative dependent variable. I want to use ANOVA to check if the dependent variable changes according to the level of the independent variable.
Here is an example of my process:
Load Survey package and create complex survey design object
library(survey)
df <- data.frame(sex = c('F', 'O', NA, 'M', 'M', 'O', 'F', 'F'),
married = c(1,1,1,1,0,0,1,1),
pens = c(0, 1, 1, NA, 1, 1, 0, 0),
weight = c(1.12, 0.55, 1.1, 0.6, 0.23, 0.23, 0.66, 0.67))
svy_design <- svydesign(ids=~1, data=df, weights=~weight)
Borrowing from this post over here,
summary(aov(weight~sex,data = svy_design))
However I got an error saying:
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'object' in selecting a method for function 'summary': object 'api00' not foun
That same post has an answer/explanation with a case against using anova:
According to the main statistician of our institute there is no easy implementation of this kind of analysis in any common modeling environment. The reason for that is that ANOVA and ANCOVA are linear models that where not further developed after the emergence of General Linear Models (later Generalized linear models - GLMs) in the 70's. A normal linear regression model yields practically the same results as an ANOVA, but is much more flexible regarding variable choice. Since weighting methods exist for GLMs (see survey package in R) there is no real need to develop methods to weight for stratified sampling design in ANOVA... simply use a GLM instead.
summary(svyglm(weight~sex,svy_design))
I got this output:
call:
svyglm(formula = weight ~ sex, design = svy_design)
Survey design:
svydesign(ids = ~1, data = df, weights = ~weight)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.8730 0.1478 5.905 0.00412 **
sexM -0.3756 0.1855 -2.024 0.11292
sexO -0.4174 0.1788 -2.334 0.07989 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0.04270091)
Number of Fisher Scoring iterations: 2
I am a far cry from a stats buff, please do explain in the simplest possible terms. Thank you!!
There is no such function as survey::aov
, so you can't use it to accomplish your goal. Your code uses stats::aov
You can use survey::svyglm
. I will use one of the examples from the package, so I can actually run the code
> model<-svyglm(api00~stype, design=dclus2)
> summary(model)
Call:
svyglm(formula = api00 ~ stype, design = dclus2)
Survey design:
dclus2<-svydesign(id=~dnum+snum, weights=~pw, data=apiclus2)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 692.81 30.28 22.878 < 2e-16 ***
stypeH -94.47 27.66 -3.415 0.00156 **
stypeM -50.46 23.01 -2.193 0.03466 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 17528.44)
Number of Fisher Scoring iterations: 2
There are three school types, E
, M
, and H
. The two coefficients here estimate differences between the mean of E
and the means of the other two groups and the $p$-values test the hypotheses that H
and E
have the same mean and that M
and E
have the same mean.
If you want an overall test for the difference in means among the three groups you can use the regTermTest
function, which tests a term or set of terms in the model, eg,
> regTermTest(model,~stype)
Wald test for stype
in svyglm(formula = api00 ~ stype, design = dclus2)
F = 12.5997 on 2 and 37 df: p= 6.7095e-05
That F test is analogous to the one stats::aov
gives. It's not identical, because this is survey data