I am working with pandas and seaborn to generate a fairly large barplot. The x axis consists of multiple identifying numbers, while the y axis displays counts. I am trying to order the x axis and sort the barplot in descending order based on counts of a binary variable (which is shown with seaborn barplot'shue
feature.)
Currently, I have tried to pass a sorted dataframe into the by
parameter of barplot, as such:
count= df['binary_variable'].value_counts()
df['count'] = df['binary_variable'].map(count)
df.sort_values(by = 'count', ascending = False, inplace = True)
sns.barplot(data = df, x = 'ID', hue = 'binary_variable')
However, this seems to have no effect on the barplot. After inspecting the sorted dataframe using .head(), it appears the sorting mechanism works. However, it does not affect the barplot.
I have also tried to change the x axis into type str with df['unique_id'] = df['unique_id'].astype(str)
, which seemed to work in the case of a question posed here: Seaborn catplot Sort by Count column
However, this method does not work in this case.
Here is some dummy data that may help with this case:
ID | binary_variable
1 1
1 1
1 0
2 1
2 0
3 1
4 1
4 1
4 1
4 1
In this dummy case, I would want the bar with ID 4
to be leftmost as it has the highest count of 1 in the binary_variable
column.
From the comments, I understand you want to have a plot with counts of a "unique id" and a "binary variable", ordered by the count of the "unique id" where "binary variable" equals 1
.
Seaborn's histplot
and countplot
don't have an order=
parameter. But sns.barplot
does.
The code below has some comments trying to clarify the steps.
df.loc[df['binary_variable'] == 1, ['unique_id', 'binary_variable']]
makes a subset with 'binary_variable' == 1
and selects the 'unique_id' and 'binary_variable' (in that order). On this selection, the values are counted. .reset_index()
converts the index of the count dataframe back to regular columns.
import seaborn as sns
import pandas as pd
import numpy as np
# create some dummy test data
df = pd.DataFrame({'binary_variable': np.random.randint(0,2,200),
'unique_id': np.random.randint(1,6,200) })
# count where binary_variable == 1, and sort by count
count1 = df.loc[df['binary_variable'] == 1, ['unique_id', 'binary_variable']].value_counts(sort=True, ascending=False).reset_index()
# count where binary_variable == 0
count0 = df.loc[df['binary_variable'] == 0, ['unique_id', 'binary_variable']].value_counts().reset_index()
# concatenate both count dataframes
both_counts = pd.concat([count1, count0], ignore_index=True)
# create a barplot of the counts, the order is taken from count1
# the hue order is changed to have the 1's first
sns.set_style('whitegrid')
sns.barplot(data=both_counts, x='unique_id', order=count1['unique_id'], y='count',
hue='binary_variable', hue_order=[1, 0], palette='summer')
sns.despine()