Search code examples
pythonmatplotlibseabornbar-chart

Sort Seaborn Histogram by Count in a Binary Variable


I am working with pandas and seaborn to generate a fairly large barplot. The x axis consists of multiple identifying numbers, while the y axis displays counts. I am trying to order the x axis and sort the barplot in descending order based on counts of a binary variable (which is shown with seaborn barplot'shue feature.)

Currently, I have tried to pass a sorted dataframe into the by parameter of barplot, as such:

count= df['binary_variable'].value_counts()
df['count'] = df['binary_variable'].map(count)
df.sort_values(by = 'count', ascending = False, inplace = True)

sns.barplot(data = df, x = 'ID', hue = 'binary_variable')

However, this seems to have no effect on the barplot. After inspecting the sorted dataframe using .head(), it appears the sorting mechanism works. However, it does not affect the barplot.

I have also tried to change the x axis into type str with df['unique_id'] = df['unique_id'].astype(str), which seemed to work in the case of a question posed here: Seaborn catplot Sort by Count column However, this method does not work in this case.

Here is some dummy data that may help with this case:

ID | binary_variable
1          1
1          1
1          0
2          1
2          0
3          1
4          1
4          1
4          1
4          1

In this dummy case, I would want the bar with ID 4 to be leftmost as it has the highest count of 1 in the binary_variable column.


Solution

  • From the comments, I understand you want to have a plot with counts of a "unique id" and a "binary variable", ordered by the count of the "unique id" where "binary variable" equals 1.

    Seaborn's histplot and countplot don't have an order= parameter. But sns.barplot does.

    The code below has some comments trying to clarify the steps.

    df.loc[df['binary_variable'] == 1, ['unique_id', 'binary_variable']] makes a subset with 'binary_variable' == 1 and selects the 'unique_id' and 'binary_variable' (in that order). On this selection, the values are counted. .reset_index() converts the index of the count dataframe back to regular columns.

    import seaborn as sns
    import pandas as pd
    import numpy as np
    
    # create some dummy test data
    df = pd.DataFrame({'binary_variable': np.random.randint(0,2,200),
                       'unique_id': np.random.randint(1,6,200) })
    
    # count where binary_variable == 1, and sort by count
    count1 = df.loc[df['binary_variable'] == 1, ['unique_id', 'binary_variable']].value_counts(sort=True, ascending=False).reset_index()
    # count where binary_variable == 0
    count0 = df.loc[df['binary_variable'] == 0, ['unique_id', 'binary_variable']].value_counts().reset_index()
    # concatenate both count dataframes
    both_counts = pd.concat([count1, count0], ignore_index=True)
    
    # create a barplot of the counts, the order is taken from count1
    # the hue order is changed to have the 1's first
    sns.set_style('whitegrid')
    sns.barplot(data=both_counts, x='unique_id', order=count1['unique_id'],  y='count',
                hue='binary_variable', hue_order=[1, 0], palette='summer')
    sns.despine()
    

    seaborn count plot ordered by first hue value