Search code examples
pythonmultivariate-testingscipy.stats

Scipy Normaltest with multi-columns dataset


I have a dataset with 100 rows and 21 columns where columns are the variables. I want to know if these variables came from a multivariate normal distribution. Thus, I've used de Normaltest from Scipy library but I can't understand the results. Here is my code:

import pandas as pd
from scipy import stats

df = pd.DataFrame(np.random.random(2100).reshape(100,21)) # dataset (100x21)
k2, p = stats.normaltest(df)

In this example, p is a 21-array not a single value. Can anybody explain how to interpret this array?


Solution

  • If p[x]<0.05, you may assume that values in column x are not normally distributed. Because with normality test, the null hypothesis is that population is normally distributed. With p<0.05, there is only less than 5% chance that we accept this hypothesis, which is statistically low. Oppositely, it p[i]>0.5, the data are normally distributed. You can easily test it with a normal distribution:

    import pandas as pd
    from scipy import stats
    df = pd.DataFrame(np.random.normal(0,1,2100).reshape(100,21)) # dataset (100x21)
    k2, p = stats.normaltest(df)
    print (p)
    

    The output is

        [0.97228661 0.49017509 0.97373345 0.97404468 0.03498392 0.61963074
     0.07712131 0.52632157 0.29887186 0.30822356 0.14416431 0.11015074
     0.81773481 0.52919266 0.81859869 0.24855451 0.16817784 0.0117747
     0.76860707 0.40384319 0.97038048]
    

    with most of them larger than 0.05.

    For testing of multivariate normality, you may try Henze-Zirkler test:

    import pingouin as pg
    normal, p = pg.multivariate_normality(df, alpha=.05)
    

    where .05 is the significant value (you may change it if you want, it will not affect the p value you obtain.)