Search code examples
pythonrseabornhistogram

Seaborn: How to scale Y axis to 100 percent for each categorical value


Objective:

I want to create a stack histogram of a PaperlessBilling categorical feature (Telco Customer Churn dataset), display the Y axis as a percentage and display the churn distribution as the hue. But, the percentage is not from the accumulative calculation.

Here is what I expected if using R:

ggplot(Churn, aes(SeniorCitizen, fill = Churn)) +
  geom_bar(position = "fill") +
  xlab("Senior Citizen status") +
  ylab("Percent") +
  scale_y_continuous(labels = scales::percent) +
  scale_x_discrete(labels = c("Non-Senior Citizens", "Senior Citizens")) +
  scale_fill_manual(name = "Churn Status", values = c("green2", "red1"), labels = c("No", "Yes")) +
  ggtitle("The Ratio of Churns by Senior Citizen status") +
  theme_classic() +
  theme(legend.position = "bottom",
        plot.title = element_text(hjust = 0.5, size = 15))

Here is the output of above code (see that both of the category has total 100%):

enter image description here

Here is what I've done:

fig, axs = plt.subplots(figsize=(5, 5))

sns.histplot(
    df,
    x = "PaperlessBilling",
    hue = "Churn",
    multiple = "stack",
    stat = "percent"
)

This is the output of above code:

enter image description here


Solution

  • With stat="percent", all bars sum up to 100. To have the bars belonging to the same x-value summing up to 100, you can use multiple='fill'. Note that in the latter case, the sum is 1.0. The PercentFormatter shows the y-axis as percentages.

    import matplotlib.pyplot as plt
    from matplotlib.ticker import PercentFormatter
    import seaborn as sns
    import pandas as pd
    import numpy as np
    
    df = pd.DataFrame({"PaperlessBilling": np.random.choice(['Yes', 'No'], p=[.6, .4], size=2000)})
    df["Churn"] = np.where(df["PaperlessBilling"] == 'Yes',
                           np.random.choice(['Yes', 'No'], p=[.3, .7], size=2000),
                           np.random.choice(['Yes', 'No'], p=[.1, .9], size=2000))
    df["PaperlessBilling"] = pd.Categorical(df["PaperlessBilling"], ['Yes', 'No'])  # fix an order
    df["Churn"] = pd.Categorical(df["Churn"], ['No', 'Yes'])  # fix an order
    
    palette = {'Yes': 'crimson', 'No': 'limegreen'}
    fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(10, 5))
    
    sns.histplot(df, x="PaperlessBilling", hue="Churn", palette=palette, alpha=1,
                 multiple="stack", stat="percent", ax=ax1)
    ax1.yaxis.set_major_formatter(PercentFormatter(100))
    
    sns.histplot(df, x="PaperlessBilling", hue="Churn", palette=palette, alpha=1,
                 multiple="fill", ax=ax2)
    ax2.yaxis.set_major_formatter(PercentFormatter(1))
    sns.despine()
    plt.tight_layout()
    plt.show()
    

    sns.histplot stat="percent" vs multiple="fill"