ANOVA for categorical data 4 groups in R studio

I'm trying to run an anova test for descriptive variables with 4 different groups, those 4 groups are grouped according to the presence or absence of 2 complications.

My data

structure(list(values = c("F", "F", "M", "F", "F", "M", "F", 
"F", "F", "F", "F", "F", "F", "M", "M", "F", "F", "F", "F", "M"
), ind = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Group 1", 
"Group 2 ", "Group 3", "Group 4"), class = "factor")), row.names = c(NA, 
20L), class = "data.frame")

I tried the code below to run the anova test

anovaresult= aov(data_new$values ~ data_new$ind, data=data_new)

and I'm getting the error message below:


Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 
  NA/NaN/Inf in 'y'
In addition: Warning message:
In storage.mode(v) <- "double" : NAs introduced by coercion
>

Many thanks Please note my df is created from stacking the 4 groups together by the function stacked()

Solution

An ANOVA is used when you have a categorical independent variable and you want to test for differences between the means of a normally distributed continuous dependent variable. Your dependent variable is dichotomous (M/F), so ANOVA is not appropriate.

Lets say you have categorical data similar to your data such as this:

# Data
set.seed(123)
df <- data.frame(result = sample(0:1, 100, replace = TRUE),
                 group = sample(paste("Group", 1:4), 100, replace = TRUE))

Since the data are randomly drawn from a uniform distribution, we would not expect any difference between the groups. We can test this statistically using a Chi-squared test, a popular choice. In R this is implemented as:

# Parametric Chi Squared
chisq.test(df$result, df$group)

#  Pearson's Chi-squared test
# 
# data:  df$result and df$group
# X-squared = 0.18662, df = 3, p-value = 0.9797

Here you see the p-value is well above the standard 0.05, so we would conclude there is no difference.

If these data were nonparametric (ie, likert-style data), we could use a nonparametric analog called the Kruskal-Wallace. In R this is implemented as:

kruskal.test(df$result, df$group)
 
#  Kruskal-Wallis rank sum test
# 
# data:  df$result and df$group
# Kruskal-Wallis chi-squared = 0.18475, df = 3, p-value = 0.98

You could also use logistic regression to examine the strength of the association, if any. In R this could be implemented by:

mdl <- glm(result ~ group, data = df, family = binomial(link = "logit"))
summary(mdl)

# Call:
# glm(formula = result ~ group, family = binomial(link = "logit"), 
#     data = df)
# 
# Deviance Residuals: 
#    Min      1Q  Median      3Q     Max  
# -1.128  -1.034  -1.034   1.281   1.328  
# 
# Coefficients:
#              Estimate Std. Error z value Pr(>|z|)
# (Intercept)   -0.1178     0.4859  -0.242    0.808
# groupGroup 2  -0.2305     0.6150  -0.375    0.708
# groupGroup 3  -0.2305     0.6150  -0.375    0.708
# groupGroup 4  -0.1234     0.6312  -0.195    0.845
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 136.66  on 99  degrees of freedom
# Residual deviance: 136.48  on 96  degrees of freedom
# AIC: 144.48
# 
# Number of Fisher Scoring iterations: 4

Note that in logistic regression you would want to transform these coefficients and standard errors to give the Odds Ratio (OR). In R you could do this by:

exp(coef(mdl))
# (Intercept) groupGroup 2 groupGroup 3 groupGroup 4 
#   0.8888889    0.7941176    0.7941176    0.8839286 

exp(confint(mdl))

#                 2.5 %   97.5 %
# (Intercept)  0.3337300 2.324904
# groupGroup 2 0.2352175 2.680856
# groupGroup 3 0.2352175 2.680856
# groupGroup 4 0.2537607 3.081575

As you can see, the OR confidence intervals contain the null (no difference) - as expected.

These are just some examples of how to implement statistical tests and measures of effect in your type of data, but is not comprehensive. Good Luck!