Search code examples
pythonpandasggplot2visualizationplotnine

plotnine: How to hide or not plot labels of small counts


I have

enter image description here

which was plotted from

import pandas as pd
from plotnine import ggplot, aes, after_stat, geom_bar, geom_label


def combine(counts: pd.Series, percentages: pd.Series):
    fmt = "{} ({}%)".format
    return [
        fmt(c, round(p))
        for c, p
        in zip(counts, percentages, strict=True)
    ]

d = {
    'cat': [*(2200 * ['cat1']), *(180 * ['cat2']), *(490 * ['cat3'])],
    'subcat': [
        *(2200 * ['subcat1']),
        *(150 * ['subcat2']),
        *(30 * ['subcat3']),
        *(40 * ['subcat4']),
        *(450 * ['subcat5'])
    ]
}
df = pd.DataFrame(d)

cats = (
    ggplot(df, aes('cat', fill='subcat'))
    + geom_bar()
    + geom_label(
          aes(label=after_stat('combine(count, count / sum(count) * 100)')),
          stat='count',
          position='stack'
      )
)
cats.save('cats.png')

The combine function was modified from the original in Show counts and percentages for bar plots.

The label for subcat4 is partially covered by the one for subcat5, making its count and percentage unreadable.

How can a label be hidden or, better yet, simply not plotted if its count is too small?

I tried

    ...
        fmt(c, round(p)) if p > 5 else (None, None)
        ...

but that just makes the labels with percentages lower than or equal to 5% say “(None, None).”

Using position='fill' for both geom_bar and geom_label is not really a solution either because the problem persists for sufficiently small counts (e.g., if the count for subcat4 is 10). And I also want to preserve proportionality of subcategories across all categories, which is lost with position='fill'.

The end goal, really, is to just not have labels overlap, so other approaches—other than hiding them—are acceptable too. (I thought of “dodging” labels vertically on the y-axis, but I don’t think that’s possible.)


Solution

  • You may modify the combine function to return an empty string '' instead of (None, None) like this:

    def combine(counts: pd.Series, percentages: pd.Series):
        fmt = "{} ({}%)".format
        return [
            fmt(c, round(p)) if p > 5 else ''
            for c, p
            in zip(counts, percentages, strict=True)
        ]