I'm struggling to create a stacked bar chart derived from value_counts()
of a columns from a dataframe.
Assume a dataframe like the following, where responder
is not important, but would like to stack the count of [1,2,3,4,5]
for all q#
columns.
responder, q1, q2, q3, q4, q5
------------------------------
r1, 5, 3, 2, 4, 1
r2, 3, 5, 1, 4, 2
r3, 2, 1, 3, 4, 5
r4, 1, 4, 5, 3, 2
r5, 1, 2, 5, 3, 4
r6, 2, 3, 4, 5, 1
r7, 4, 3, 2, 1, 5
Look something like, except each bar would be labled by q#
and it would include 5 sections for count of [1,2,3,4,5]
from the data:
Ideally, all bars will be "100%" wide, showing the count as a proportion of the bar. But it's gauranteed that each responder
row will have one entry for each, so the percentage is just a bonus if possible.
Any help would be much appreciated, with a slight preference for matplotlib
solution.
You can calculate the heights of bars using percentages and obtain the stacked bar plot using ax = percents.T.plot(kind='barh', stacked=True)
where percents
is a DataFrame with q1,...q5
as columns and 1,...,5
as indices.
>>> percents
q1 q2 q3 q4 q5
1 0.196873 0.199316 0.206644 0.194919 0.202247
2 0.205357 0.188988 0.205357 0.205357 0.194940
3 0.202265 0.217705 0.184766 0.196089 0.199177
4 0.199494 0.199494 0.190886 0.198481 0.211646
5 0.196137 0.195146 0.211491 0.205052 0.192174
Then you can use ax.patches
to add labels for every bar. Labels can be generated from the original counts DataFrame: counts = df.apply(lambda x: x.value_counts())
>>> counts
q1 q2 q3 q4 q5
1 403 408 423 399 414
2 414 381 414 414 393
3 393 423 359 381 387
4 394 394 377 392 418
5 396 394 427 414 388
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
## create some data similar to yours
np.random.seed(42)
categories = ['q1','q2','q3','q4','q5']
df = pd.DataFrame(np.random.randint(1,6,size=(2000, 5)), columns=categories)
## counts will be used for the labels
counts = df.apply(lambda x: x.value_counts())
## percents will be used to determine the height of each bar
percents = counts.div(counts.sum(axis=1), axis=0)
counts_array = counts.values
nrows, ncols = counts_array.shape
indices = [(i,j) for i in range(0,nrows) for j in range(0,ncols)]
percents_array = percents.values
ax = percents.T.plot(kind='barh', stacked=True)
ax.legend(bbox_to_anchor=(1, 1.01), loc='upper right')
for i, p in enumerate(ax.patches):
ax.annotate(f"({p.get_width():.2f}%)", (p.get_x() + p.get_width() - 0.15, p.get_y() - 0.10), xytext=(5, 10), textcoords='offset points')
ax.annotate(str(counts_array[indices[i]]), (p.get_x() + p.get_width() - 0.15, p.get_y() + 0.10), xytext=(5, 10), textcoords='offset points')
plt.show()