I have a dataset with 100 rows and 21 columns where columns are the variables. I want to know if these variables came from a multivariate normal distribution. Thus, I've used de Normaltest from Scipy library but I can't understand the results. Here is my code:
import pandas as pd
from scipy import stats
df = pd.DataFrame(np.random.random(2100).reshape(100,21)) # dataset (100x21)
k2, p = stats.normaltest(df)
In this example, p is a 21-array not a single value. Can anybody explain how to interpret this array?
If p[x]<0.05
, you may assume that values in column x
are not normally distributed. Because with normality test, the null hypothesis is that population is normally distributed. With p<0.05, there is only less than 5% chance that we accept this hypothesis, which is statistically low.
Oppositely, it p[i]>0.5, the data are normally distributed. You can easily test it with a normal distribution:
import pandas as pd
from scipy import stats
df = pd.DataFrame(np.random.normal(0,1,2100).reshape(100,21)) # dataset (100x21)
k2, p = stats.normaltest(df)
print (p)
The output is
[0.97228661 0.49017509 0.97373345 0.97404468 0.03498392 0.61963074
0.07712131 0.52632157 0.29887186 0.30822356 0.14416431 0.11015074
0.81773481 0.52919266 0.81859869 0.24855451 0.16817784 0.0117747
0.76860707 0.40384319 0.97038048]
with most of them larger than 0.05.
For testing of multivariate normality, you may try Henze-Zirkler test:
import pingouin as pg
normal, p = pg.multivariate_normality(df, alpha=.05)
where .05 is the significant value (you may change it if you want, it will not affect the p value you obtain.)