Search code examples
matplotlibhistogramprobability-densityscipy.stats

Fitting & scaling a probability density function correctly to a histogram with a logarithmic x-axis?


I am trying to fit a gilbrat PDF to a dataset (that I have in form of a list). I want to show the data in a histogram with a logarithmic x-scale and add the fitted curve. However, the curve seems too flat compared to the histogram, like in this picture: enter image description here I tried to scale the pdf according to Fitting a Gaussian to a histogram with MatPlotLib and Numpy - wrong Y-scaling? , but the problem remains.

Here is a code example with randomly created data:

import scipy.stats as st
import numpy as np
import matplotlib.pyplot as plt

#create random dataset
data = st.gilbrat.rvs(scale = 10,  size = 100).tolist()

param = st.gilbrat.fit(data)
x = np.linspace(min(data),max(data),len(data))
pdf = st.gilbrat.pdf(x, param[0], param[1])

plt.figure()    
logbins = np.logspace(np.log10(np.min(data)),np.log10(np.max(data)),20)
result = plt.hist(data, bins=logbins, edgecolor="black",  alpha = 0.5, label="data")
dx = result[1][1] - result[1][0]
plt.plot(x,pdf * (len(data)*dx), label="fit")
plt.xscale('log')
plt.xlabel("x")
plt.ylabel("Number of occurence")
plt.legend()

Am I missing something?


Solution

  • As your bins aren't equally spaced, the histogram isn't similar to a scaled version of the pdf. The bins at the right represent a much wider x-range than the ones at the left.

    To predict the heights of the rectangles given the pdf, each bin region needs a different scaling factor, depending on the width of that bin.

    The following code rescales each region independently, resulting in a discontinuously scaled pdf.

    import scipy.stats as st
    import numpy as np
    import matplotlib.pyplot as plt
    
    # create random dataset
    np.random.seed(1)
    data = st.gilbrat.rvs(scale=10, size=100)
    
    param = st.gilbrat.fit(data)
    x = np.logspace(np.log10(data.min()), np.log10(data.max()), 500)
    pdf = st.gilbrat.pdf(x, param[0], param[1])
    
    plt.figure()
    logbins = np.logspace(np.log10(data.min()), np.log10(data.max()), 20)
    heights, bins, rectangles = plt.hist(data, bins=logbins, edgecolor="black", alpha=0.5, label="data")
    for b0, b1 in zip(bins[:-1], bins[1:]):
         dx = b1 - b0
         x_bin = np.logspace(np.log10(b0), np.log10(b1), 100)
         pdf_bin = st.gilbrat.pdf(x_bin, param[0], param[1])
         plt.plot(x_bin, pdf_bin * (len(data) * dx), color='crimson',
                  label="expected bin height" if b0 == bins[0] else None)
    plt.xscale('log')
    plt.xlabel("x")
    plt.ylabel("Number of occurence")
    plt.legend()
    
    plt.tight_layout()
    plt.show()
    

    non uniform scaling of pdf to log-scale bins

    Here is another take, smoothing out the scaling of the pdf to any log-scale histogram. The dx is different at each x-position, in contrast to the histogram with linearly spaced bins.

    import scipy.stats as st
    import numpy as np
    import matplotlib.pyplot as plt
    
    # create random dataset
    np.random.seed(1)
    data = st.gilbrat.rvs(scale=10, size=100)
    
    param = st.gilbrat.fit(data)
    x = np.logspace(np.log10(data.min()), np.log10(data.max()), 500)
    pdf = st.gilbrat.pdf(x, param[0], param[1])
    
    plt.figure()
    logbins = np.logspace(np.log10(data.min()), np.log10(data.max()), 20)
    heights, bins, rectangles = plt.hist(data, bins=logbins, edgecolor="black", alpha=0.5, label="data")
    
    dx_array = np.logspace(np.log10(bins[1] - bins[0]), np.log10(bins[-1] - bins[-2]), len(x))
    plt.plot(x, pdf * len(data) * dx_array, color='crimson', label="pdf rescaled like histogram")
    
    plt.xscale('log')
    plt.xlabel("x")
    plt.ylabel("Number of occurence")
    plt.legend()
    
    plt.tight_layout()
    plt.show()
    

    smoothed scaling of the pdf