Search code examples
statisticsanovat-test

Parametric or Non-parametric group test for 5 different groups


Problem Statement - Statistically prove that 5 groups are same or different

  • I am working on a problem with dataset size ~600,000.

  • There are 5 groups say [A,B,C,D,E] and corresponding salaries with around ~100k observations per group.

df['Salary'] is slightly right skewed. I tried ANOVA and Kruskal test.

ANOVA Results

If I use all data - The p value indicates that groups are statistically different (p

If I use 10K random samples within each group p value increases to ~0.002333

If I use 1000 random samples within each group p value exceed 0.05 and is of the order of ~0.5

I am not sure how to evaluate these results? What should be the sample size to be considered and what other methods shall I consider

Mean and SD of 5 groups are below (when I consider 100,000 random sample for each group:

Group 1 - (12.134831460674159, 5.1823701530849995)

Group 2 - (11.64860907759883, 5.092876703946831)

Group 3 - (11.660195118395315, 4.952100116921575)

Group 4 - (12.052747507535358, 5.091383288751849)

Group 5 - (11.468062169943916, 4.996349965883181)

KRUSKAL RESULTS

When sample size = 100

KruskalResult(statistic=34.20564125753886, pvalue=6.762162830091762e-07)

When sample size 10,000

KruskalResult(statistic=179.39353155924363, pvalue=1.0064249109632168e-37)

Distribution of Avg salary - Total population of ~600k


Solution

  • You have a huge sample size, 100k for each group. With this many data points you are almost guaranteed to find a statistically significant difference / result. These statistical tests were not really designed for such big sample sizes.

    You should use all your data to get the best possible estimates, however you will have to use domain knowledge to decide whether the difference is practically significant. Also you should look at the confidence intervals to determine the effect.

    Also, an ANOVA makes an assumption on the normal distribution of the residuals, not the data.