Search code examples
pythonpandasmatplotlibstacked-chartplot-annotations

How to create and annotate a stacked proportional bar chart


I'm struggling to create a stacked bar chart derived from value_counts() of a columns from a dataframe.

Assume a dataframe like the following, where responder is not important, but would like to stack the count of [1,2,3,4,5] for all q# columns.

responder, q1, q2, q3, q4, q5
------------------------------
r1, 5, 3, 2, 4, 1
r2, 3, 5, 1, 4, 2
r3, 2, 1, 3, 4, 5
r4, 1, 4, 5, 3, 2
r5, 1, 2, 5, 3, 4
r6, 2, 3, 4, 5, 1
r7, 4, 3, 2, 1, 5

Look something like, except each bar would be labled by q# and it would include 5 sections for count of [1,2,3,4,5] from the data:

enter image description here

Ideally, all bars will be "100%" wide, showing the count as a proportion of the bar. But it's gauranteed that each responder row will have one entry for each, so the percentage is just a bonus if possible.

Any help would be much appreciated, with a slight preference for matplotlib solution.


Solution

  • You can calculate the heights of bars using percentages and obtain the stacked bar plot using ax = percents.T.plot(kind='barh', stacked=True) where percents is a DataFrame with q1,...q5 as columns and 1,...,5 as indices.

    >>> percents
             q1        q2        q3        q4        q5
    1  0.196873  0.199316  0.206644  0.194919  0.202247
    2  0.205357  0.188988  0.205357  0.205357  0.194940
    3  0.202265  0.217705  0.184766  0.196089  0.199177
    4  0.199494  0.199494  0.190886  0.198481  0.211646
    5  0.196137  0.195146  0.211491  0.205052  0.192174
    

    Then you can use ax.patches to add labels for every bar. Labels can be generated from the original counts DataFrame: counts = df.apply(lambda x: x.value_counts())

    >>> counts
        q1   q2   q3   q4   q5
    1  403  408  423  399  414
    2  414  381  414  414  393
    3  393  423  359  381  387
    4  394  394  377  392  418
    5  396  394  427  414  388
    

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    
    ## create some data similar to yours
    np.random.seed(42)
    categories = ['q1','q2','q3','q4','q5']
    df = pd.DataFrame(np.random.randint(1,6,size=(2000, 5)), columns=categories)
    
    ## counts will be used for the labels
    counts = df.apply(lambda x: x.value_counts())
    
    ## percents will be used to determine the height of each bar
    percents = counts.div(counts.sum(axis=1), axis=0)
    
    counts_array = counts.values
    nrows, ncols = counts_array.shape
    indices = [(i,j) for i in range(0,nrows) for j in range(0,ncols)]
    percents_array = percents.values
    
    ax = percents.T.plot(kind='barh', stacked=True)
    ax.legend(bbox_to_anchor=(1, 1.01), loc='upper right')
    for i, p in enumerate(ax.patches):
        ax.annotate(f"({p.get_width():.2f}%)", (p.get_x() + p.get_width() - 0.15, p.get_y() - 0.10), xytext=(5, 10), textcoords='offset points')
        ax.annotate(str(counts_array[indices[i]]), (p.get_x() + p.get_width() - 0.15, p.get_y() + 0.10), xytext=(5, 10), textcoords='offset points')
    plt.show()
    

    enter image description here