Search code examples
pythonprobability-distribution

How to smooth a probability distribution plot in Python?


I made a probability DataFrame df, sorted by value:

    value   prob
0   -31     0.002597
1   -23     0.005195
2   -22     0.005195
3   -21     0.002597
4   -20     0.002597
5   -18     0.005195
6   -15     0.002597
...
39  19      0.007792
40  21      0.002597
41  22      0.005195
42  23      0.002597
43  25      0.002597
44  28      0.002597
45  29      0.005195
46  37      0.002597

(As you can see, values of valuedo not cover all integers between df[0] and df[46])

I plotted a probability distribution plot by simply executing:

import matplotlib as plt

plt.plot(df['value'], df['prob'])

in which it returned

enter image description here

Now, I would like to smooth the probability curve, so I have tried two approaches. First, I tried np.polyfit:

import numpy as np

x = df['value']
y = df['prob']
n = 10

poly = np.polyfit(x,y,n)
poly_y = np.poly1d(poly)(x)
plt.plot(x,poly_y, color='red')
plt.plot(x,y, color='blue')

and the resulting graph reads

enter image description here

which does not round the probability successfully (manipulating n value did not solve the under-rounding problem).

Secondly, I tried scipy.interpolate:

from scipy import interpolate

xnew = np.linspace(x.min(), x.max(), 10) 
bspline = interpolate.make_interp_spline(x, y)
y_smoothed = bspline(xnew)
plt.plot(xnew, y_smoothed, color='red')
plt.plot(x,y, color='blue')

and this returns

enter image description here

which encounters the same problem of under-representing the probability at value = 0 (and not really smoothing it either).

Any recommendations of how to successfully smooth the probability distribution plot without significant under- or over-representation of the probabilities?


Solution

  • Probability distributions generated from a sample of observations are usually represented with a histogram. My understanding as to why this is standard practice is that a histogram (i.e. contiguous bars instead of a line) presents a more truthful picture of the underlying data.

    In the example you give, the data is binned as integers. As I am lacking the information regarding what is being measured, let's first assume that your data is truly discrete and can only take integer values (e.g. net points scored by a soccer team at the end of each game during one year). Then in this case drawing your probability distribution with a line is somewhat deceitful as it gives the impression that the variable is continuous when in fact it isn't (a soccer team cannot end a game with net +1.5 points).

    If in fact, your data is continuous that would mean that the data sample you provided has been binned into integer-wide bins. In this case, even though you do have truly continuous data, displaying the probability density with a continuous line can still be deceitful for the following reason. As an example, let's say that you rounded all your measurements using .5 as the mid-point. Then you could maybe have had 10 measurements at -0.4 and 5 measurements at +0.3, and none in between. Yet your graph gives the impression that you had many measurements at exactly 0 when there were actually none at all.

    Using bars instead of a line solves this issue as it makes it clearer that data points could be located anywhere in the range of values covered by the width of the bars, you would just need to state whether the bars are left or right inclusive with regards to the labels on the x-axis.

    On to the issue of smoothing your curve. The most common way of doing this to my knowledge is using kernel density estimation. You can read about it here and see how it is implemented in Python here and here. More perspective on histograms versus kernel density estimates (KDE) and how to choose an optimal bandwidth can be found here and here.

    Here is an example of how to draw a KDE using pandas. I first create a random variable similar to your example and plot it the same way for comparison.

    import numpy as np                 # v 1.19.2
    import pandas as pd                # v 1.1.3
    import matplotlib.pyplot as plt    # v 3.3.2
    
    # Create an integer-valued random variable using numpy
    rng = np.random.default_rng(123)
    variable = rng.laplace(loc=0, scale=5, size=1000).round()
    
    # Create dataframe with values and probabilities
    var_range = variable.max()-variable.min()
    probabilities, values = np.histogram(variable, bins=int(var_range), density=True)
    
    # Plot probability distribution like in your example
    df = pd.DataFrame(dict(value=values[:-1], prob=probabilities))
    df.plot.line(x='value', y='prob')
    

    density_line

    Now regardless of the reasons I mentioned why drawing the distribution like this is not recommended, trying to plot a KDE over this plot based on computed probability densities would be a headache compared to if you plot your original variable directly. Indeed, plotting packages in Python are built to handle probability distributions of variables using the raw data measurements rather than computed probabilities. The following example illustrates this.

    # Better way to plot the probability distribution of measured data,
    # with a kernel density estimate
    
    s = pd.Series(variable)
    
    # Plot pandas histogram
    s.plot.hist(bins=20, density=True, edgecolor='w', linewidth=0.5)
    
    # Save default x-axis limits for final formatting because the pandas kde
    # plot uses much wider limits which decreases readability
    ax = plt.gca()
    xlim = ax.get_xlim()
    
    # Plot pandas KDE
    s.plot.density(color='black', alpha=0.5) # identical to s.plot.kde(...)
    
    # Reset hist x-axis limits and add legend
    ax.set_xlim(xlim)
    ax.legend(labels=['KDE'], frameon=False)
    

    hist_kde_pandas

    You can get the same plot with seaborn using the following code.

    import seaborn as sns    # v 0.11.0
    sns.histplot(data=variable, bins=20, stat='density', alpha= 1, kde=True,
                 edgecolor='white', linewidth=0.5,
                 line_kws=dict(color='black', alpha=0.5,
                               linewidth=1.5, label='KDE'))
    plt.gca().get_lines()[0].set_color('black') # manually edit line color due to bug in sns v 0.11.0
    plt.legend(frameon=False)
    

    Documentation: pandas, seaborn