I'm doing ANOVA analysis on a dataset that looks like:
ID | cluster | gender | description | country | age |
---|---|---|---|---|---|
124 | 1 | F | SMALL | US | 32 |
324 | 1 | M | MEDIUM | CA | 12 |
82 | 5 | F | SMALL | US | 45 |
453 | 2 | F | LARGE | AU | 34 |
... | ... | ... | ... | ... | ... |
473 | 3 | M | SMALL | DK | 32 |
172 | 4 | M | LARGE | US | 62 |
23 | 4 | F | LARGE | US | 23 |
This dataset is a summary of the data. My data is divided into 5 clusters, and I would like to see how all the other variables, except ID, are unique to each cluster. I was recommended to use an ANOVA analysis for this. I got some results, but they didn't quite line up with what I saw graphically. For example, I thought gender would be statistically significant given my bar graph visualization, but it's not. This is okay if true (obviously there is nothing I can do about it), but I wanted to check to be sure I was doing this correctly and to see if there is a different way to do this other than ANOVA.
here is what I have been doing:
one.way.gender <- aov(cluster ~ gender, data = data1)
summary(one.way.gender)
one.way.description <- aov(cluster ~ description, data = data1)
summary(one.way.description)
one.way.country <- aov(cluster ~ country, data = data1)
summary(one.way.country)
one.way.age <- aov(cluster ~ age, data = data1)
summary(one.way.age)
I know this is a very simple way of doing this, and I'm worried I'm missing something. I've went through a few tutorials with more lengthy code to the analysis, but they yield the same results as the simple code.
library(dplyr)
library(purrr)
# Categorical variables
variables <- c("gender", "description", "country")
# chi-square test
result <- variables %>%
map(~chisq.test(df$cluster, df[[.]]))
result
# output:
[[1]]
Pearson's Chi-squared test
data: df$cluster and df[[.]]
X-squared = 2.9167, df = 4, p-value = 0.5719
[[2]]
Pearson's Chi-squared test
data: df$cluster and df[[.]]
X-squared = 9.3333, df = 8, p-value = 0.315
[[3]]
Pearson's Chi-squared test
data: df$cluster and df[[.]]
X-squared = 16.625, df = 12, p-value = 0.1643
# numeric variables
kruskal_age <- kruskal.test(age ~ cluster, data = df)
print(kruskal_age)
# output
Kruskal-Wallis rank sum test
data: age by cluster
Kruskal-Wallis chi-squared = 2.5909, df = 4, p-value = 0.6284