Using spark I aggregated data for each group (cohort) to only contain the mean, standard deviation, and variance.
Now in a second step using python I would like to test for normality (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.normaltest.html) and afterward for significance using either the t-test stats.ttest_ind
or stats.wilcoxon
rank test.
However, all these methods expect the data to be fed in as raw record-oriented values. How can I use them with the pre-aggregated data?
Mean, standard deviation and variance are not enough to test for normality in each cohort. Standard deviation is the square root of the variance, so you only have the information of two statistics.
You could also (or instead) calculate the two summary statistics skewness and kurtosis and also save the count of the observations. The Jarque–Bera test is a test for normality which only depends on the skewness, kurtosis and number of observations.