Search code examples
python-3.xdatasetnormal-distributionprobability-distributionweibull

Understand the nature of distribution for a dataset in Python?


Let's say I have a dataset (sinusoidal curve in this example):

import matplotlib.pyplot as plt
import numpy as np

T = 1
Fs = 10000
N = T*Fs
t = np.linspace(0,T,N)
x = 10 * np.sin(2*np.pi*2*t) 


plt.figure(figsize=(8,8))
plt.plot(t,x,'k')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

How do I figure out the nature distribution (normal/weibull/uniform/exponential/ etc.) Of 'x'?


Solution

  • Basically you have to do a Goodness of fit test iteratively over potentially fitting distributions to see which one best fits your sample data.

    Luckily fitter not only provides the iteration process using Scipy (meaning you could do that manually with Scipy as well) but also displays a plot and table of the statistics values.

    Some np.random example distributions and the sine function from your question along with the respective code below.

    • Note the heads-up sections at the end.

    Pre-sets:

    import numpy as np
    from fitter import Fitter, get_common_distributions
    
    distributions_set = get_common_distributions()
    distributions_set.extend(['arcsine', 'cosine', 'expon', 'weibull_max', 
                              'weibull_min', 'dweibull', 't', 'pareto', 
                              'exponnorm', 'lognorm'])
    

    Sine (from your example):

    # arcsine = inverse sine
    T = 1
    Fs = 10_000
    N = T*Fs
    t = np.linspace(0,T,N)
    np_sine_arr = 10 * np.sin(2*np.pi*2*t) 
    
    
    f_sine = Fitter(np_sine_arr, distributions = distributions_set) 
    f_sine.fit()
    f_sine.summary()
    

    enter image description here

    Normal: Note t-test for normal distributions.

    # normal
    mu, sigma = 0.0, 0.1 # mean and standard deviation
    np_normal_arr = np.random.normal(mu, sigma, 10_000)
    
    
    f_normal = Fitter(np_normal_arr, distributions = distributions_set)  
    f_normal.fit()
    f_normal.summary()
    

    enter image description here

    Rayleigh:

    # rayleigh
    meanvalue = 1
    modevalue = np.sqrt(2 / np.pi) * meanvalue # shape
    np_rayleigh_arr = np.random.rayleigh(modevalue, 10_000)
    
    
    f_rayleigh = Fitter(np_rayleigh_arr, distributions = distributions_set) 
    f_rayleigh.fit()
    f_rayleigh.summary()
    

    enter image description here

    Pareto:

    # pareto
    a, m = 3., 2. # shape and mode
    np_pareto_arr = (np.random.pareto(a, 10_000) + 1) * m
    
    
    f_pareto = Fitter(np_pareto_arr, distributions = distributions_set) 
    f_pareto.fit()
    f_pareto.summary()
    

    enter image description here

    Weibull:

    # weibull
    a = 5. # shape
    np_weibull_arr = np.random.weibull(a, 10_000)
    
    
    f_weibull = Fitter(np_weibull_arr, distributions = distributions_set) 
    f_weibull.fit()
    f_weibull.summary()
    

    enter image description here

    Exponent:

    # exp
    np_exp_arr = np.random.exponential(scale=1.0, size=10_000)
    
    
    f_exp = Fitter(np_exp_arr, distributions = distributions_set) 
    f_exp.fit()
    f_exp.summary()
    

    enter image description here


    Heads-up 1) Make sure the latest fitter version is installed - currently 1.4.1
    You may have to install also some dependencies.

    import fitter
    
    print(fitter.version)
    # 1.4.1
    

    If you got an logging error that's likely because you have a previous version.

    For me it was conda install -c bioconda fitter


    Heads-up 2) fitter has a lot of distributions to test, which takes a long time if you go for all of them.

    Best is to reduce the distrubtions based on the common ones with some you think are likely for your data (as done in the code above in the pre-sets section).

    To get a list of all available distributions:

    from fitter import get_distributions
    
    get_distributions()
    

    Heads-up 3) Depending on the distribution several very similar ones can come up close together. You could see that as well in some of the examples above.

    Also especially when a distribution is slightly altered (e.g. mean ...) often a different can fit as well, see e.g. Wikipedia Gamma distribution probalility density plot which can look like a lot of other distributions depending on the parameters.