Search code examples

Questions about ANOVA analysis in R of groups

I'm doing ANOVA analysis on a dataset that looks like:

ID cluster gender description country age
124 1 F SMALL US 32
324 1 M MEDIUM CA 12
82 5 F SMALL US 45
453 2 F LARGE AU 34
... ... ... ... ... ...
473 3 M SMALL DK 32
172 4 M LARGE US 62
23 4 F LARGE US 23

This dataset is a summary of the data. My data is divided into 5 clusters, and I would like to see how all the other variables, except ID, are unique to each cluster. I was recommended to use an ANOVA analysis for this. I got some results, but they didn't quite line up with what I saw graphically. For example, I thought gender would be statistically significant given my bar graph visualization, but it's not. This is okay if true (obviously there is nothing I can do about it), but I wanted to check to be sure I was doing this correctly and to see if there is a different way to do this other than ANOVA.

here is what I have been doing:

one.way.gender <- aov(cluster ~ gender, data = data1)

one.way.description <- aov(cluster ~ description, data = data1)
summary(one.way.description) <- aov(cluster ~ country, data = data1)

one.way.age <- aov(cluster ~ age, data = data1)

I know this is a very simple way of doing this, and I'm worried I'm missing something. I've went through a few tutorials with more lengthy code to the analysis, but they yield the same results as the simple code.


  • for categorical data use chi-square test:

    # Categorical variables
    variables <- c("gender", "description", "country")
    # chi-square test 
    result <- variables %>% 
      map(~chisq.test(df$cluster, df[[.]]))
    # output: 
        Pearson's Chi-squared test
    data:  df$cluster and df[[.]]
    X-squared = 2.9167, df = 4, p-value = 0.5719
        Pearson's Chi-squared test
    data:  df$cluster and df[[.]]
    X-squared = 9.3333, df = 8, p-value = 0.315
        Pearson's Chi-squared test
    data:  df$cluster and df[[.]]
    X-squared = 16.625, df = 12, p-value = 0.1643

    for numeric data use Kruskal Wallis test:

    # numeric variables
    kruskal_age <- kruskal.test(age ~ cluster, data = df)
    # output
        Kruskal-Wallis rank sum test
    data:  age by cluster
    Kruskal-Wallis chi-squared = 2.5909, df = 4, p-value = 0.6284