matplotlib histogram probability-density scipy.stats

Fitting & scaling a probability density function correctly to a histogram with a logarithmic x-axis?

I am trying to fit a gilbrat PDF to a dataset (that I have in form of a list). I want to show the data in a histogram with a logarithmic x-scale and add the fitted curve. However, the curve seems too flat compared to the histogram, like in this picture: I tried to scale the pdf according to Fitting a Gaussian to a histogram with MatPlotLib and Numpy - wrong Y-scaling? , but the problem remains.

Here is a code example with randomly created data:

import scipy.stats as st
import numpy as np
import matplotlib.pyplot as plt

#create random dataset
data = st.gilbrat.rvs(scale = 10,  size = 100).tolist()

param = st.gilbrat.fit(data)
x = np.linspace(min(data),max(data),len(data))
pdf = st.gilbrat.pdf(x, param[0], param[1])

plt.figure()    
logbins = np.logspace(np.log10(np.min(data)),np.log10(np.max(data)),20)
result = plt.hist(data, bins=logbins, edgecolor="black",  alpha = 0.5, label="data")
dx = result[1][1] - result[1][0]
plt.plot(x,pdf * (len(data)*dx), label="fit")
plt.xscale('log')
plt.xlabel("x")
plt.ylabel("Number of occurence")
plt.legend()

Am I missing something?

Solution

As your bins aren't equally spaced, the histogram isn't similar to a scaled version of the pdf. The bins at the right represent a much wider x-range than the ones at the left.

To predict the heights of the rectangles given the pdf, each bin region needs a different scaling factor, depending on the width of that bin.

The following code rescales each region independently, resulting in a discontinuously scaled pdf.

import scipy.stats as st
import numpy as np
import matplotlib.pyplot as plt

# create random dataset
np.random.seed(1)
data = st.gilbrat.rvs(scale=10, size=100)

param = st.gilbrat.fit(data)
x = np.logspace(np.log10(data.min()), np.log10(data.max()), 500)
pdf = st.gilbrat.pdf(x, param[0], param[1])

plt.figure()
logbins = np.logspace(np.log10(data.min()), np.log10(data.max()), 20)
heights, bins, rectangles = plt.hist(data, bins=logbins, edgecolor="black", alpha=0.5, label="data")
for b0, b1 in zip(bins[:-1], bins[1:]):
     dx = b1 - b0
     x_bin = np.logspace(np.log10(b0), np.log10(b1), 100)
     pdf_bin = st.gilbrat.pdf(x_bin, param[0], param[1])
     plt.plot(x_bin, pdf_bin * (len(data) * dx), color='crimson',
              label="expected bin height" if b0 == bins[0] else None)
plt.xscale('log')
plt.xlabel("x")
plt.ylabel("Number of occurence")
plt.legend()

plt.tight_layout()
plt.show()

Here is another take, smoothing out the scaling of the pdf to any log-scale histogram. The dx is different at each x-position, in contrast to the histogram with linearly spaced bins.

import scipy.stats as st
import numpy as np
import matplotlib.pyplot as plt

# create random dataset
np.random.seed(1)
data = st.gilbrat.rvs(scale=10, size=100)

param = st.gilbrat.fit(data)
x = np.logspace(np.log10(data.min()), np.log10(data.max()), 500)
pdf = st.gilbrat.pdf(x, param[0], param[1])

plt.figure()
logbins = np.logspace(np.log10(data.min()), np.log10(data.max()), 20)
heights, bins, rectangles = plt.hist(data, bins=logbins, edgecolor="black", alpha=0.5, label="data")

dx_array = np.logspace(np.log10(bins[1] - bins[0]), np.log10(bins[-1] - bins[-2]), len(x))
plt.plot(x, pdf * len(data) * dx_array, color='crimson', label="pdf rescaled like histogram")

plt.xscale('log')
plt.xlabel("x")
plt.ylabel("Number of occurence")
plt.legend()

plt.tight_layout()
plt.show()