I am trying to fit a gilbrat PDF to a dataset (that I have in form of a list). I want to show the data in a histogram with a logarithmic x-scale and add the fitted curve. However, the curve seems too flat compared to the histogram, like in this picture: I tried to scale the pdf according to Fitting a Gaussian to a histogram with MatPlotLib and Numpy - wrong Y-scaling? , but the problem remains.
Here is a code example with randomly created data:
import scipy.stats as st
import numpy as np
import matplotlib.pyplot as plt
#create random dataset
data = st.gilbrat.rvs(scale = 10, size = 100).tolist()
param = st.gilbrat.fit(data)
x = np.linspace(min(data),max(data),len(data))
pdf = st.gilbrat.pdf(x, param[0], param[1])
plt.figure()
logbins = np.logspace(np.log10(np.min(data)),np.log10(np.max(data)),20)
result = plt.hist(data, bins=logbins, edgecolor="black", alpha = 0.5, label="data")
dx = result[1][1] - result[1][0]
plt.plot(x,pdf * (len(data)*dx), label="fit")
plt.xscale('log')
plt.xlabel("x")
plt.ylabel("Number of occurence")
plt.legend()
Am I missing something?
As your bins aren't equally spaced, the histogram isn't similar to a scaled version of the pdf. The bins at the right represent a much wider x-range than the ones at the left.
To predict the heights of the rectangles given the pdf, each bin region needs a different scaling factor, depending on the width of that bin.
The following code rescales each region independently, resulting in a discontinuously scaled pdf.
import scipy.stats as st
import numpy as np
import matplotlib.pyplot as plt
# create random dataset
np.random.seed(1)
data = st.gilbrat.rvs(scale=10, size=100)
param = st.gilbrat.fit(data)
x = np.logspace(np.log10(data.min()), np.log10(data.max()), 500)
pdf = st.gilbrat.pdf(x, param[0], param[1])
plt.figure()
logbins = np.logspace(np.log10(data.min()), np.log10(data.max()), 20)
heights, bins, rectangles = plt.hist(data, bins=logbins, edgecolor="black", alpha=0.5, label="data")
for b0, b1 in zip(bins[:-1], bins[1:]):
dx = b1 - b0
x_bin = np.logspace(np.log10(b0), np.log10(b1), 100)
pdf_bin = st.gilbrat.pdf(x_bin, param[0], param[1])
plt.plot(x_bin, pdf_bin * (len(data) * dx), color='crimson',
label="expected bin height" if b0 == bins[0] else None)
plt.xscale('log')
plt.xlabel("x")
plt.ylabel("Number of occurence")
plt.legend()
plt.tight_layout()
plt.show()
Here is another take, smoothing out the scaling of the pdf to any log-scale histogram. The dx
is different at each x-position, in contrast to the histogram with linearly spaced bins.
import scipy.stats as st
import numpy as np
import matplotlib.pyplot as plt
# create random dataset
np.random.seed(1)
data = st.gilbrat.rvs(scale=10, size=100)
param = st.gilbrat.fit(data)
x = np.logspace(np.log10(data.min()), np.log10(data.max()), 500)
pdf = st.gilbrat.pdf(x, param[0], param[1])
plt.figure()
logbins = np.logspace(np.log10(data.min()), np.log10(data.max()), 20)
heights, bins, rectangles = plt.hist(data, bins=logbins, edgecolor="black", alpha=0.5, label="data")
dx_array = np.logspace(np.log10(bins[1] - bins[0]), np.log10(bins[-1] - bins[-2]), len(x))
plt.plot(x, pdf * len(data) * dx_array, color='crimson', label="pdf rescaled like histogram")
plt.xscale('log')
plt.xlabel("x")
plt.ylabel("Number of occurence")
plt.legend()
plt.tight_layout()
plt.show()