Search code examples
pythonmatplotlibpandashistogramjupyter-notebook

Plotting a histogram in Pandas with very heavy-tailed data


I am often working with data that has a very 'long tail'. I want to plot histograms to summarize the distribution, but when I try to using pandas I wind up with a bar graph that has one giant visible bar and everything else invisible.

Here is an example of the series I am working with. Since it's very long, I used value_counts() so it will fit on this page.

In [10]: data.value_counts.sort_index()

Out[10]:
0          8012
25         3710
100       10794
200       11718
300        2489
500        7631
600          34
700         115
1000       3099
1200       1766
1600         63
2000       1538
2200         41
2500        208
2700       2138
5000        515
5500        201
8800         10
10000        10
10900       465
13000         9
16200        74
20000       518
21500        65
27000        64
53000        82
56000         1
106000       35
530000        3

I'm guessing that the answer involves binning the less common results into larger groups somehow (53000, 56000, 106000, and 53000 into one group of >50000, etc.), and also changing the y index to represent percentages of the occurrence rather than the absolute number. However, I don't understand how I would go about doing that automatically.


Solution

  • import pandas as pd
    from matplotlib import pyplot as plt
    import numpy as np
    
    
    mydict = {0: 8012,25: 3710,100: 10794,200: 11718,300: 2489,500: 7631,600: 34,700: 115,1000: 3099,1200: 1766,1600: 63,2000: 1538,2200: 41,2500: 208,2700: 2138,5000: 515,5500: 201,8800: 10,10000: 10,10900: 465,13000: 9,16200: 74,20000: 518,21500: 65,27000: 64,53000: 82,56000: 1,106000: 35,530000: 3}
    mylist = []
    
    for key in mydict:
    for e in range(mydict[key]):
        mylist.insert(0,key)
    
    df = pd.DataFrame(mylist,columns=['value'])
    df2 = df[df.value <= 5000]
    

    Plot as a bar:

    fig = df.value.value_counts().sort_index().plot(kind="bar")
    plt.savefig("figure.png")
    

    bar

    As a histogram (limited to values 5000 & under which is >97% of your data): I like using linspace to control buckets.

    df2 = df[df.value <= 5000]
    df2.hist(bins=np.linspace(0,5000,101))
    plt.savefig('hist1')
    

    enter image description here

    EDIT: Changed np.linspace(0,5000,100) to np.linspace(0,5000,101) & updated histogram.