Search code examples
rcategorical-dataanova

ANOVA for categorical data 4 groups in R studio


I'm trying to run an anova test for descriptive variables with 4 different groups, those 4 groups are grouped according to the presence or absence of 2 complications.

My data

structure(list(values = c("F", "F", "M", "F", "F", "M", "F", 
"F", "F", "F", "F", "F", "F", "M", "M", "F", "F", "F", "F", "M"
), ind = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Group 1", 
"Group 2 ", "Group 3", "Group 4"), class = "factor")), row.names = c(NA, 
20L), class = "data.frame")

I tried the code below to run the anova test

anovaresult= aov(data_new$values ~ data_new$ind, data=data_new)

and I'm getting the error message below:


Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 
  NA/NaN/Inf in 'y'
In addition: Warning message:
In storage.mode(v) <- "double" : NAs introduced by coercion
> 

Many thanks Please note my df is created from stacking the 4 groups together by the function stacked()


Solution

  • An ANOVA is used when you have a categorical independent variable and you want to test for differences between the means of a normally distributed continuous dependent variable. Your dependent variable is dichotomous (M/F), so ANOVA is not appropriate.

    Lets say you have categorical data similar to your data such as this:

    # Data
    set.seed(123)
    df <- data.frame(result = sample(0:1, 100, replace = TRUE),
                     group = sample(paste("Group", 1:4), 100, replace = TRUE))
    

    Since the data are randomly drawn from a uniform distribution, we would not expect any difference between the groups. We can test this statistically using a Chi-squared test, a popular choice. In R this is implemented as:

    # Parametric Chi Squared
    chisq.test(df$result, df$group)
    
    #  Pearson's Chi-squared test
    # 
    # data:  df$result and df$group
    # X-squared = 0.18662, df = 3, p-value = 0.9797
    

    Here you see the p-value is well above the standard 0.05, so we would conclude there is no difference.

    If these data were nonparametric (ie, likert-style data), we could use a nonparametric analog called the Kruskal-Wallace. In R this is implemented as:

    kruskal.test(df$result, df$group)
     
    #  Kruskal-Wallis rank sum test
    # 
    # data:  df$result and df$group
    # Kruskal-Wallis chi-squared = 0.18475, df = 3, p-value = 0.98
    

    You could also use logistic regression to examine the strength of the association, if any. In R this could be implemented by:

    mdl <- glm(result ~ group, data = df, family = binomial(link = "logit"))
    summary(mdl)
    
    # Call:
    # glm(formula = result ~ group, family = binomial(link = "logit"), 
    #     data = df)
    # 
    # Deviance Residuals: 
    #    Min      1Q  Median      3Q     Max  
    # -1.128  -1.034  -1.034   1.281   1.328  
    # 
    # Coefficients:
    #              Estimate Std. Error z value Pr(>|z|)
    # (Intercept)   -0.1178     0.4859  -0.242    0.808
    # groupGroup 2  -0.2305     0.6150  -0.375    0.708
    # groupGroup 3  -0.2305     0.6150  -0.375    0.708
    # groupGroup 4  -0.1234     0.6312  -0.195    0.845
    # 
    # (Dispersion parameter for binomial family taken to be 1)
    # 
    #     Null deviance: 136.66  on 99  degrees of freedom
    # Residual deviance: 136.48  on 96  degrees of freedom
    # AIC: 144.48
    # 
    # Number of Fisher Scoring iterations: 4
    

    Note that in logistic regression you would want to transform these coefficients and standard errors to give the Odds Ratio (OR). In R you could do this by:

    exp(coef(mdl))
    # (Intercept) groupGroup 2 groupGroup 3 groupGroup 4 
    #   0.8888889    0.7941176    0.7941176    0.8839286 
    
    exp(confint(mdl))
    
    #                 2.5 %   97.5 %
    # (Intercept)  0.3337300 2.324904
    # groupGroup 2 0.2352175 2.680856
    # groupGroup 3 0.2352175 2.680856
    # groupGroup 4 0.2537607 3.081575
    

    As you can see, the OR confidence intervals contain the null (no difference) - as expected.

    These are just some examples of how to implement statistical tests and measures of effect in your type of data, but is not comprehensive. Good Luck!