Search code examples
pythonseabornhistogramnormalizationkernel-density

Seaborn probability histplot - KDE normalization


When plotting histplot with default stats (density) and KDE flag set to True, the area under the curve is equal to 1. From the Seaborn documentation:

"The units on the density axis are a common source of confusion. While kernel density estimation produces a probability distribution, the height of the curve at each point gives a density, not a probability. A probability can be obtained only by integrating the density across a range. The curve is normalized so that the integral over all possible values is 1, meaning that the scale of the density axis depends on the data values."

Below is the example of density histplot with default KDE normalized to 1.

enter image description here

However, you can also plot a histogram with stats as count or probability. Plotting KDE on top of those will produce the below:

enter image description here enter image description here

How is the KDE normalized? The area certainly is not equal to 1, but is has to be somehow normalized. I could not find this in the docs, the only explanation regards KDE plotted for density histogram. Any help appreciated here, thank you!


Solution

  • Well, the region below the kde curve has an area of 1. To draw a kde which matches the histogram, the kde needs to be multiplied by the area of the histogram.

    For a density plot, the histogram has an area of 1, so the kde can be used as-is.

    For a count plot, the sum of the histogram heights will be the length of the given data (each data item will belong to exactly one bar). The area of the histogram will be that total height multiplied by the width of the bins. (When the bins wouldn't have equal widths, adjusting the kde would be quite tricky).

    For a probability plot, the sum of the histogram heights will be 1 (for 100 %). The total area will be the bin_width multiplied by the heights, so equal to the bin_width.

    Here is some code to explain what's going on. It uses standard matplotlib bars, numpy to calculate the histogram and scipy for the kde:

    import matplotlib.pyplot as plt
    from scipy.stats import gaussian_kde
    import numpy as np
    
    data = [115, 127, 128, 145, 160]
    bin_values, bin_edges = np.histogram(data, bins=4)
    bin_width = bin_edges[1] - bin_edges[0]
    total_area = bin_width * len(data)
    
    kde = gaussian_kde(data)
    x = np.linspace(bin_edges[0], bin_edges[-1], 200)
    
    fig, axs = plt.subplots(ncols=3, figsize=(14, 3))
    kws = {'align': 'edge', 'color': 'dodgerblue', 'alpha': 0.4, 'edgecolor': 'white'}
    axs[0].bar(x=bin_edges[:-1], height=bin_values / total_area, width=bin_width, **kws)
    axs[0].plot(x, kde(x), color='dodgerblue')
    axs[0].set_ylabel('density')
    
    axs[1].bar(x=bin_edges[:-1], height=bin_values / len(data), width=bin_width, **kws)
    axs[1].plot(x, kde(x) * bin_width, color='dodgerblue')
    axs[1].set_ylabel('probability')
    
    axs[2].bar(x=bin_edges[:-1], height=bin_values, width=bin_width, **kws)
    axs[2].plot(x, kde(x) * total_area, color='dodgerblue')
    axs[2].set_ylabel('count')
    
    plt.tight_layout()
    plt.show()
    

    calculating seaborn histograms with kde