Search code examples
rstatisticsanovasurvey

One-way anova using the Survey package in R


I am trying to identify the best way to run a one-way Anova on a complex survey design. After perusing Lumley's Survey package documentation, I am none the wiser.

The survey::anova function is meant to 'Fit and compare hierarchical loglinear models for complex survey data', which is not what I am doing.

What I am trying to do I have collected data about one categorical independent variable [3 levels] and one quantitative dependent variable. I want to use ANOVA to check if the dependent variable changes according to the level of the independent variable.

Here is an example of my process:

Load Survey package and create complex survey design object

library(survey)

df <- data.frame(sex = c('F', 'O', NA, 'M', 'M', 'O', 'F', 'F'),
                 married = c(1,1,1,1,0,0,1,1),
                 pens = c(0, 1, 1, NA, 1, 1, 0, 0),
                 weight = c(1.12, 0.55, 1.1, 0.6, 0.23, 0.23, 0.66, 0.67))

svy_design <- svydesign(ids=~1, data=df, weights=~weight)

Borrowing from this post over here,

Method 1: using survey::aov

summary(aov(weight~sex,data = svy_design))

However I got an error saying:

Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'object' in selecting a method for function 'summary': object 'api00' not foun

Method 2: use survey::glm instead of anova

That same post has an answer/explanation with a case against using anova:

According to the main statistician of our institute there is no easy implementation of this kind of analysis in any common modeling environment. The reason for that is that ANOVA and ANCOVA are linear models that where not further developed after the emergence of General Linear Models (later Generalized linear models - GLMs) in the 70's. A normal linear regression model yields practically the same results as an ANOVA, but is much more flexible regarding variable choice. Since weighting methods exist for GLMs (see survey package in R) there is no real need to develop methods to weight for stratified sampling design in ANOVA... simply use a GLM instead.

summary(svyglm(weight~sex,svy_design))

I got this output:

call:
svyglm(formula = weight ~ sex, design = svy_design)

Survey design:
svydesign(ids = ~1, data = df, weights = ~weight)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   0.8730     0.1478   5.905  0.00412 **
sexM         -0.3756     0.1855  -2.024  0.11292   
sexO         -0.4174     0.1788  -2.334  0.07989 . 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.04270091)

Number of Fisher Scoring iterations: 2

My Questions:

  1. Why does method 1 throw an error?
  2. Is it possible to use the survey::aov function accomplish my goal?
  3. If I were to use survey::glm [method 2], which value should I be looking at to identify a difference in means? Would it be the p value of the intercept?

I am a far cry from a stats buff, please do explain in the simplest possible terms. Thank you!!


Solution

  • There is no such function as survey::aov, so you can't use it to accomplish your goal. Your code uses stats::aov

    You can use survey::svyglm. I will use one of the examples from the package, so I can actually run the code

    > model<-svyglm(api00~stype, design=dclus2)
    > summary(model)
    
    Call:
    svyglm(formula = api00 ~ stype, design = dclus2)
    
    Survey design:
    dclus2<-svydesign(id=~dnum+snum, weights=~pw, data=apiclus2)
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept)   692.81      30.28  22.878  < 2e-16 ***
    stypeH        -94.47      27.66  -3.415  0.00156 ** 
    stypeM        -50.46      23.01  -2.193  0.03466 *  
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    (Dispersion parameter for gaussian family taken to be 17528.44)
    
    Number of Fisher Scoring iterations: 2
    

    There are three school types, E, M, and H. The two coefficients here estimate differences between the mean of E and the means of the other two groups and the $p$-values test the hypotheses that H and E have the same mean and that M and E have the same mean.

    If you want an overall test for the difference in means among the three groups you can use the regTermTest function, which tests a term or set of terms in the model, eg,

    > regTermTest(model,~stype)
    Wald test for stype
     in svyglm(formula = api00 ~ stype, design = dclus2)
    F =  12.5997  on  2  and  37  df: p= 6.7095e-05 
    

    That F test is analogous to the one stats::aov gives. It's not identical, because this is survey data