Search code examples
pythonplotninegeom-histogram

plotnine geom_histogram wrong bin placement


I'm trying to define very specifically the bins of my histogram so that their size is exactly 10.

Here is an example. I defined a list of numbers. The list contains 10 numbers with 1 digit, and then 50 numbers between 50 and 59, 60 numbers between 60 and 69, and so on.

rand_numbers = ([0]*5   + [9]*5) + \
               ([50]*20 + [59]*30) + \
               ([60]*30 + [69]*30) + \
               ([70]*35 + [79]*35) + \
               ([80]*40 + [89]*40) + \
               ([90]*45 + [99]*45)

Then I create a data frame where I "classified" the numbers so that numbers up to 69 are in a color, numbers in the 70s are in another color, and all numbers above 80 are another color:

df = pd.DataFrame({
    'c1': rand_numbers,
    'c2': ['foo'] * 120 + ['bar'] * 70 + ['baz']*170
})

To make the histogram, I'm doing:

import plotnine as p9

p = p9.ggplot(df, p9.aes(x='c1', fill = 'c2')) + \
    p9.scale_x_continuous(breaks=range(0, 120, 10)) +\
    p9.geom_histogram(size=0.5, colour='black', breaks=range(0, 120, 10))

enter image description here

As you can see, the bins are "spilling" onto one another. Here is more or less what I expected:

better_histogram With exactly 10 in the first bin, and exactly 50 in the next, then exactly 60, then exactly 70, and so on

That is, I expected a histogram with exactly 10 elements in the first bin, exactly 50 elements in the next bin (between 50 and 59), then exactly 60 elements in the next one. All of the aforementioned bins should be completely blue. Then, a red bin with exactly 70 elements, and then two green bins with exactly 80 and 90 elements.

As you can see, I'm using the solution suggested here and here on how to predefine the bins in geom_histogram(), but it didn't work the way I expected.

In attempting to solve this problem, I found:

EDIT: I noticed that, if I do the following, it "works". Still, I'm not sure if this is a trustworthy solution (?).

geom_histogram(size=0.5, colour='black',
     breaks=range(-1, 120, 10))   # <------ here, starting in -1

Solution

  • By default geom_histogram or stat_bin use closed="right" (see here), i.e. the bins are closed on the right aka the right edge is included in the bin and the left edge is excluded. Instead, to achieve your desired result you have to set closed="left":

    import plotnine as p9
    import pandas as pd
    
    rand_numbers = ([0]*5   + [9]*5) + \
                   ([50]*20 + [59]*30) + \
                   ([60]*30 + [69]*30) + \
                   ([70]*35 + [79]*35) + \
                   ([80]*40 + [89]*40) + \
                   ([90]*45 + [99]*45)
    
    df = pd.DataFrame({
        'c1': rand_numbers,
        'c2': ['foo'] * 120 + ['bar'] * 70 + ['baz']*170
    })
    
    p9.ggplot(df, p9.aes(x='c1', fill = 'c2')) + \
        p9.scale_x_continuous(breaks=range(0, 120, 10)) +\
        p9.geom_histogram(
            size=0.5, colour='black', 
            breaks=range(0, 120, 10), closed = "left"
        )
    

    enter image description here