Search code examples
pythonpandasmatplotlibgraphdata-analysis

How do we apply the Central Limit Theorem using python?


I've a huge dataset with 271116 rows of data. I normalized the data using the z-score normalization method. I've no idea of knowing if the data actually follows a normal distribution. So I plotted a simple density graph using matplotlib:

hdf = df['Height'].plot(kind = 'kde', stacked = False)
plt.show()

I got this for a result:

enter image description here

Though, the data seems somewhat normal, can I apply the Central Limit Theorem where I take the means of different random samples (say, 10000 times) to get a smooth bell-curve?

Any help in python is appreciated, thanks.


Solution

  • Something like:

    import numpy as np
    sampleMeans = []
    for _ in range(100000):
        samples = df['Height'].sample(n=100)
        sampleMean = np.mean(samples)
        sampleMeans.append(sampleMean)
    
    #Now you have a list of sample means to plot - should be normally distributed
    

    The mean of the distribution should equal the mean of the original data, and the standard deviation should be a factor of ten less than the original data. If the result isn't smooth enough, then increase .sample(n=100) to a higher figure. This will also decrease the standard deviation of the resulting bell curve. The general rule is that the CLT standard deviation is the data standard deviation divided by sqrt(n).

    It's important to note that the resulting distribution is different from the original. It is not merely smoothed out using the CLT.