Search code examples
pandashistogram

How to make a percentage of the type column in a histogram?


The pandas dataframe is as the following format:

df = pd.DataFrame([['A', 4], ['B', 12], ['B', 50], ['B', 19], ['A', 39], ['B', 12], ['A', 22], ['B', 33], ['B', 14], ['B', 43], ['A', 50], ['B', 34], ['A', 22],  ['B', 60],
              ['A', 14], ['B', 31], ['B', 40], ['B', 38], ['A', 21], ['B', 41], ['A', 23], ['B', 45], ['B', 25], ['B', 32], ['A', 10], ['B', 31], ['A', 21],  ['B', 51]])
df.columns = ['Type', 'Distance']

df.head(2):

  Type  Distance

0    A         4

1    B        12

Expectation:

1) create a histogram with bins=10, 
2) y is the (amount of A in this bin)/(amount of A + amount of B in this bin)

For example, in bin 10-20:

df1 = df[df['Distance'].between(10,20)]
df1.head(10)

There are 6 rows in this distance bin 10-20:

Type    Distance

1   B   12

3   B   19

5   B   12

8   B   14

14  A   14

24  A   10

Hence,

  1. amount of A in bin (10,20): 2

  2. amount of B in bin (10,20): 4

  3. amount of A in bin (10,20) / ((amount of A in bin (10,20) + amount of B in bin (10,20)) = 2 / (2+4) = 33.3%

  4. So for for bin 10-20, the y value should be 33.3%.

The above step applies to other bin to calculate the percentage of A out of (A+B).

I also need to keep the X bin be 10, not float or else.

Any suggestion is appreciated.


Solution

  • IIUC, you should pivot_table to reshape your data, compute the bins in pandas using cut and then plot:

    df2 = df.pivot_table(index='Distance', columns='Type', aggfunc='size', fill_value=0)
    bins = range(0, int(df2.index.max())+1, 10)
    df3 = df2.groupby(pd.cut(df2.index, bins=bins)).sum()
    df3['A'].div(df3.sum(1)).plot.bar(width=1)
    

    output:

    A/B

    >>> df3
    Type          A  B
    (3.944, 9.6]  1  0
    (9.6, 15.2]   2  3
    (15.2, 20.8]  0  1
    (20.8, 26.4]  5  1
    (26.4, 32.0]  0  3
    (32.0, 37.6]  0  2
    (37.6, 43.2]  1  4
    (43.2, 48.8]  0  1
    (48.8, 54.4]  1  2
    (54.4, 60.0]  0  1