Search code examples
pythonpandasscipyhistogramkurtosis

How can I calculate the kurtosis of already binned data?


Does anyone know how to calculate the kurtosis of a distribution from binned data alone using Python?

I have a histogram of a distribution, but not the raw data. There are two columns; one with the bin number and one with the count number. I need to calculate the kurtosis of the distribution.

If I had the raw data, I could use the scipy function to calculate kurtosis. I can't see anything within this documentation to calculate using binned data. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kurtosis.html

The binned statistics option with scipy allows you to calculate the kurtosis within a bin, but only using raw data and just within bins. https://docs.scipy.org/doc/scipy-0.16.0/reference/generated/scipy.stats.binned_statistic.html

Edit: Example data. I could try and resample from this to create my own dummy raw data, but I have about 140k of these to run each day and was hoping for something built-in.

Index,Bin,Count
 0, 730, 30
 1, 735, 45
 2, 740, 41
 3, 745, 62
 4, 750, 80
 5, 755, 96
 6, 760, 94
 7, 765, 90
 8, 770, 103
 9, 775, 96
10, 780, 95
11, 785, 109
12, 790, 102
13, 795, 99
14, 800, 93
15, 805, 101
16, 810, 109
17, 815, 98
18, 820, 89
19, 825, 62
20, 830, 71
21, 835, 69
22, 840, 58
23, 845, 50
24, 850, 42

Solution

  • You can just calculate the statistics directly. If x is your bin numbers, and y is the counts for each bin, then the expected value of f(x) is equal to np.sum(y*f(x))/np.sum(y). We can use this to translate the formula for kurtosis into the following code:

    total = np.sum(y)
    mean = np.sum(y * x) / total
    variance = np.sum(y * (x - mean)**2) / total
    kurtosis = np.sum(y * (x - mean)**4) / (variance**2 * total)
    

    Note that kurtosis and excess kurtosis are not the same thing.