Search code examples
pythonnumpymatplotlibscipykernel-density

How can you create a KDE from histogram values only?


I have a set of values that I'd like to plot the gaussian kernel density estimation of, however there are two problems that I'm having:

  1. I only have the values of bars not the values themselves
  2. I am plotting onto a categorical axis

Here's the plot I've generated so far: HISTOGRAM The order of the y axis is actually relevant since it is representative of the phylogeny of each bacterial species.

I'd like to add a gaussian kde overlay for each color, but so far I haven't been able to leverage seaborn or scipy to do this.

Here's the code for the above grouped bar plot using python and matplotlib:

enterN = len(color1_plotting_values)
fig, ax = plt.subplots(figsize=(20,30))
ind = np.arange(N)    # the x locations for the groups
width = .5         # the width of the bars
p1 = ax.barh(Species_Ordering.Species.values, color1_plotting_values, width, label='Color1', log=True)
p2 = ax.barh(Species_Ordering.Species.values, color2_plotting_values, width, label='Color2', log=True)
for b in p2:
    b.xy = (b.xy[0], b.xy[1]+width)

Thanks!


Solution

  • How to plot a "KDE" starting from a histogram

    The protocol for kernel density estimation requires the underlying data. You could come up with a new method that uses the empirical pdf (ie the histogram) instead, but then it wouldn't be a KDE distribution.

    Not all hope is lost, though. You can get a good approximation of a KDE distribution by first taking samples from the histogram, and then using KDE on those samples. Here's a complete working example:

    import matplotlib.pyplot as plt
    import numpy as np
    import scipy.stats as sts
    
    n = 100000
    
    # generate some random multimodal histogram data
    samples = np.concatenate([np.random.normal(np.random.randint(-8, 8), size=n)*np.random.uniform(.4, 2) for i in range(4)])
    h,e = np.histogram(samples, bins=100, density=True)
    x = np.linspace(e.min(), e.max())
    
    # plot the histogram
    plt.figure(figsize=(8,6))
    plt.bar(e[:-1], h, width=np.diff(e), ec='k', align='edge', label='histogram')
    
    # plot the real KDE
    kde = sts.gaussian_kde(samples)
    plt.plot(x, kde.pdf(x), c='C1', lw=8, label='KDE')
    
    # resample the histogram and find the KDE.
    resamples = np.random.choice((e[:-1] + e[1:])/2, size=n*5, p=h/h.sum())
    rkde = sts.gaussian_kde(resamples)
    
    # plot the KDE
    plt.plot(x, rkde.pdf(x), '--', c='C3', lw=4, label='resampled KDE')
    plt.title('n = %d' % n)
    plt.legend()
    plt.show()
    

    Output:

    enter image description here

    The red dashed line and the orange line nearly completely overlap in the plot, showing that the real KDE and the KDE calculated by resampling the histogram are in excellent agreement.

    If your histograms are really noisy (like what you get if you set n = 10 in the above code), you should be a bit cautious when using the resampled KDE for anything other than plotting purposes:

    enter image description here

    Overall the agreement between the real and resampled KDEs is still good, but the deviations are noticeable.

    Munge your categorial data into an appropriate form

    Since you haven't posted your actual data I can't give you detailed advice. I think your best bet will be to just number your categories in order, then use that number as the "x" value of each bar in the histogram.