Problem Statement - Statistically prove that 5 groups are same or different
I am working on a problem with dataset size ~600,000.
There are 5 groups say [A,B,C,D,E] and corresponding salaries with around ~100k observations per group.
df['Salary']
is slightly right skewed. I tried ANOVA and Kruskal test.
ANOVA Results
If I use all data - The p value indicates that groups are statistically different (p
If I use 10K random samples within each group p value increases to ~0.002333
If I use 1000 random samples within each group p value exceed 0.05 and is of the order of ~0.5
I am not sure how to evaluate these results? What should be the sample size to be considered and what other methods shall I consider
Mean and SD of 5 groups are below (when I consider 100,000 random sample for each group:
Group 1 - (12.134831460674159, 5.1823701530849995)
Group 2 - (11.64860907759883, 5.092876703946831)
Group 3 - (11.660195118395315, 4.952100116921575)
Group 4 - (12.052747507535358, 5.091383288751849)
Group 5 - (11.468062169943916, 4.996349965883181)
KRUSKAL RESULTS
When sample size = 100
KruskalResult(statistic=34.20564125753886, pvalue=6.762162830091762e-07)
When sample size 10,000
KruskalResult(statistic=179.39353155924363, pvalue=1.0064249109632168e-37)
You have a huge sample size, 100k for each group. With this many data points you are almost guaranteed to find a statistically significant difference / result. These statistical tests were not really designed for such big sample sizes.
You should use all your data to get the best possible estimates, however you will have to use domain knowledge to decide whether the difference is practically significant. Also you should look at the confidence intervals to determine the effect.
Also, an ANOVA makes an assumption on the normal distribution of the residuals, not the data.