Search code examples
rstatisticsdata-analysisanova

Questions about ANOVA analysis in R of groups


I'm doing ANOVA analysis on a dataset that looks like:

ID cluster gender description country age
124 1 F SMALL US 32
324 1 M MEDIUM CA 12
82 5 F SMALL US 45
453 2 F LARGE AU 34
... ... ... ... ... ...
473 3 M SMALL DK 32
172 4 M LARGE US 62
23 4 F LARGE US 23

This dataset is a summary of the data. My data is divided into 5 clusters, and I would like to see how all the other variables, except ID, are unique to each cluster. I was recommended to use an ANOVA analysis for this. I got some results, but they didn't quite line up with what I saw graphically. For example, I thought gender would be statistically significant given my bar graph visualization, but it's not. This is okay if true (obviously there is nothing I can do about it), but I wanted to check to be sure I was doing this correctly and to see if there is a different way to do this other than ANOVA.

here is what I have been doing:

one.way.gender <- aov(cluster ~ gender, data = data1)
summary(one.way.gender)

one.way.description <- aov(cluster ~ description, data = data1)
summary(one.way.description)

one.way.country <- aov(cluster ~ country, data = data1)
summary(one.way.country)

one.way.age <- aov(cluster ~ age, data = data1)
summary(one.way.age)

I know this is a very simple way of doing this, and I'm worried I'm missing something. I've went through a few tutorials with more lengthy code to the analysis, but they yield the same results as the simple code.


Solution

  • for categorical data use chi-square test:

    library(dplyr)
    library(purrr)
    
    # Categorical variables
    variables <- c("gender", "description", "country")
    
    # chi-square test 
    result <- variables %>% 
      map(~chisq.test(df$cluster, df[[.]]))
    
    result
    
    # output: 
    
    [[1]]
    
        Pearson's Chi-squared test
    
    data:  df$cluster and df[[.]]
    X-squared = 2.9167, df = 4, p-value = 0.5719
    
    
    [[2]]
    
        Pearson's Chi-squared test
    
    data:  df$cluster and df[[.]]
    X-squared = 9.3333, df = 8, p-value = 0.315
    
    
    [[3]]
    
        Pearson's Chi-squared test
    
    data:  df$cluster and df[[.]]
    X-squared = 16.625, df = 12, p-value = 0.1643
    

    for numeric data use Kruskal Wallis test:

    # numeric variables
    kruskal_age <- kruskal.test(age ~ cluster, data = df)
    print(kruskal_age)
    
    # output
        Kruskal-Wallis rank sum test
    
    data:  age by cluster
    Kruskal-Wallis chi-squared = 2.5909, df = 4, p-value = 0.6284