python scipy statistics aggregate statistical-test

normality test with pre-aggregated data

Using spark I aggregated data for each group (cohort) to only contain the mean, standard deviation, and variance.

Now in a second step using python I would like to test for normality (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.normaltest.html) and afterward for significance using either the t-test stats.ttest_ind or stats.wilcoxon rank test.

However, all these methods expect the data to be fed in as raw record-oriented values. How can I use them with the pre-aggregated data?

Solution

Mean, standard deviation and variance are not enough to test for normality in each cohort. Standard deviation is the square root of the variance, so you only have the information of two statistics.

You could also (or instead) calculate the two summary statistics skewness and kurtosis and also save the count of the observations. The Jarque–Bera test is a test for normality which only depends on the skewness, kurtosis and number of observations.