I made a probability DataFrame df
, sorted by value
:
value prob
0 -31 0.002597
1 -23 0.005195
2 -22 0.005195
3 -21 0.002597
4 -20 0.002597
5 -18 0.005195
6 -15 0.002597
...
39 19 0.007792
40 21 0.002597
41 22 0.005195
42 23 0.002597
43 25 0.002597
44 28 0.002597
45 29 0.005195
46 37 0.002597
(As you can see, values of value
do not cover all integers between df[0]
and df[46]
)
I plotted a probability distribution plot by simply executing:
import matplotlib as plt
plt.plot(df['value'], df['prob'])
in which it returned
Now, I would like to smooth the probability curve, so I have tried two approaches. First, I tried np.polyfit
:
import numpy as np
x = df['value']
y = df['prob']
n = 10
poly = np.polyfit(x,y,n)
poly_y = np.poly1d(poly)(x)
plt.plot(x,poly_y, color='red')
plt.plot(x,y, color='blue')
and the resulting graph reads
which does not round the probability successfully (manipulating n
value did not solve the under-rounding problem).
Secondly, I tried scipy.interpolate
:
from scipy import interpolate
xnew = np.linspace(x.min(), x.max(), 10)
bspline = interpolate.make_interp_spline(x, y)
y_smoothed = bspline(xnew)
plt.plot(xnew, y_smoothed, color='red')
plt.plot(x,y, color='blue')
and this returns
which encounters the same problem of under-representing the probability at value = 0
(and not really smoothing it either).
Any recommendations of how to successfully smooth the probability distribution plot without significant under- or over-representation of the probabilities?
Probability distributions generated from a sample of observations are usually represented with a histogram. My understanding as to why this is standard practice is that a histogram (i.e. contiguous bars instead of a line) presents a more truthful picture of the underlying data.
In the example you give, the data is binned as integers. As I am lacking the information regarding what is being measured, let's first assume that your data is truly discrete and can only take integer values (e.g. net points scored by a soccer team at the end of each game during one year). Then in this case drawing your probability distribution with a line is somewhat deceitful as it gives the impression that the variable is continuous when in fact it isn't (a soccer team cannot end a game with net +1.5 points).
If in fact, your data is continuous that would mean that the data sample you provided has been binned into integer-wide bins. In this case, even though you do have truly continuous data, displaying the probability density with a continuous line can still be deceitful for the following reason. As an example, let's say that you rounded all your measurements using .5 as the mid-point. Then you could maybe have had 10 measurements at -0.4 and 5 measurements at +0.3, and none in between. Yet your graph gives the impression that you had many measurements at exactly 0 when there were actually none at all.
Using bars instead of a line solves this issue as it makes it clearer that data points could be located anywhere in the range of values covered by the width of the bars, you would just need to state whether the bars are left or right inclusive with regards to the labels on the x-axis.
On to the issue of smoothing your curve. The most common way of doing this to my knowledge is using kernel density estimation. You can read about it here and see how it is implemented in Python here and here. More perspective on histograms versus kernel density estimates (KDE) and how to choose an optimal bandwidth can be found here and here.
Here is an example of how to draw a KDE using pandas. I first create a random variable similar to your example and plot it the same way for comparison.
import numpy as np # v 1.19.2
import pandas as pd # v 1.1.3
import matplotlib.pyplot as plt # v 3.3.2
# Create an integer-valued random variable using numpy
rng = np.random.default_rng(123)
variable = rng.laplace(loc=0, scale=5, size=1000).round()
# Create dataframe with values and probabilities
var_range = variable.max()-variable.min()
probabilities, values = np.histogram(variable, bins=int(var_range), density=True)
# Plot probability distribution like in your example
df = pd.DataFrame(dict(value=values[:-1], prob=probabilities))
df.plot.line(x='value', y='prob')
Now regardless of the reasons I mentioned why drawing the distribution like this is not recommended, trying to plot a KDE over this plot based on computed probability densities would be a headache compared to if you plot your original variable directly. Indeed, plotting packages in Python are built to handle probability distributions of variables using the raw data measurements rather than computed probabilities. The following example illustrates this.
# Better way to plot the probability distribution of measured data,
# with a kernel density estimate
s = pd.Series(variable)
# Plot pandas histogram
s.plot.hist(bins=20, density=True, edgecolor='w', linewidth=0.5)
# Save default x-axis limits for final formatting because the pandas kde
# plot uses much wider limits which decreases readability
ax = plt.gca()
xlim = ax.get_xlim()
# Plot pandas KDE
s.plot.density(color='black', alpha=0.5) # identical to s.plot.kde(...)
# Reset hist x-axis limits and add legend
ax.set_xlim(xlim)
ax.legend(labels=['KDE'], frameon=False)
You can get the same plot with seaborn using the following code.
import seaborn as sns # v 0.11.0
sns.histplot(data=variable, bins=20, stat='density', alpha= 1, kde=True,
edgecolor='white', linewidth=0.5,
line_kws=dict(color='black', alpha=0.5,
linewidth=1.5, label='KDE'))
plt.gca().get_lines()[0].set_color('black') # manually edit line color due to bug in sns v 0.11.0
plt.legend(frameon=False)