python-3.x dataset normal-distribution probability-distribution weibull

Understand the nature of distribution for a dataset in Python?

Let's say I have a dataset (sinusoidal curve in this example):

import matplotlib.pyplot as plt
import numpy as np

T = 1
Fs = 10000
N = T*Fs
t = np.linspace(0,T,N)
x = 10 * np.sin(2*np.pi*2*t) 


plt.figure(figsize=(8,8))
plt.plot(t,x,'k')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

How do I figure out the nature distribution (normal/weibull/uniform/exponential/ etc.) Of 'x'?

Solution

Basically you have to do a Goodness of fit test iteratively over potentially fitting distributions to see which one best fits your sample data.

Luckily fitter not only provides the iteration process using Scipy (meaning you could do that manually with Scipy as well) but also displays a plot and table of the statistics values.

Some np.random example distributions and the sine function from your question along with the respective code below.

Note the heads-up sections at the end.

Pre-sets:

import numpy as np
from fitter import Fitter, get_common_distributions

distributions_set = get_common_distributions()
distributions_set.extend(['arcsine', 'cosine', 'expon', 'weibull_max', 
                          'weibull_min', 'dweibull', 't', 'pareto', 
                          'exponnorm', 'lognorm'])

Sine (from your example):

# arcsine = inverse sine
T = 1
Fs = 10_000
N = T*Fs
t = np.linspace(0,T,N)
np_sine_arr = 10 * np.sin(2*np.pi*2*t) 


f_sine = Fitter(np_sine_arr, distributions = distributions_set) 
f_sine.fit()
f_sine.summary()

Normal: Note t-test for normal distributions.

# normal
mu, sigma = 0.0, 0.1 # mean and standard deviation
np_normal_arr = np.random.normal(mu, sigma, 10_000)


f_normal = Fitter(np_normal_arr, distributions = distributions_set)  
f_normal.fit()
f_normal.summary()

Rayleigh:

# rayleigh
meanvalue = 1
modevalue = np.sqrt(2 / np.pi) * meanvalue # shape
np_rayleigh_arr = np.random.rayleigh(modevalue, 10_000)


f_rayleigh = Fitter(np_rayleigh_arr, distributions = distributions_set) 
f_rayleigh.fit()
f_rayleigh.summary()

Pareto:

# pareto
a, m = 3., 2. # shape and mode
np_pareto_arr = (np.random.pareto(a, 10_000) + 1) * m


f_pareto = Fitter(np_pareto_arr, distributions = distributions_set) 
f_pareto.fit()
f_pareto.summary()

Weibull:

# weibull
a = 5. # shape
np_weibull_arr = np.random.weibull(a, 10_000)


f_weibull = Fitter(np_weibull_arr, distributions = distributions_set) 
f_weibull.fit()
f_weibull.summary()

Exponent:

# exp
np_exp_arr = np.random.exponential(scale=1.0, size=10_000)


f_exp = Fitter(np_exp_arr, distributions = distributions_set) 
f_exp.fit()
f_exp.summary()

Heads-up 1) Make sure the latest fitter version is installed - currently 1.4.1
You may have to install also some dependencies.

import fitter

print(fitter.version)
# 1.4.1

If you got an logging error that's likely because you have a previous version.

For me it was conda install -c bioconda fitter

Heads-up 2) fitter has a lot of distributions to test, which takes a long time if you go for all of them.

Best is to reduce the distrubtions based on the common ones with some you think are likely for your data (as done in the code above in the pre-sets section).

To get a list of all available distributions:

from fitter import get_distributions

get_distributions()

Heads-up 3) Depending on the distribution several very similar ones can come up close together. You could see that as well in some of the examples above.

Also especially when a distribution is slightly altered (e.g. mean ...) often a different can fit as well, see e.g. Wikipedia Gamma distribution probalility density plot which can look like a lot of other distributions depending on the parameters.