Search code examples
pythonchartsseabornboxplot

How to avoid plotting a box when number of abservations is too low with seaborn boxplot?


I'm using seaborn boxplot to plot a bunch of boxes representing some errors when varying a variable that ranges from 0 to 30. When that variable is low, like 0, 1 or 2, I have a lot of observations to make the box, but for values between 17 and 29 I have less than 4 observations and therefore I consider plotting a box is not appropriate, this is even more notorious in the 29th box which only has one observation, making the box absurdly thin, or in some cases where there are only two observations, drawing whiskers seems simply nonsensical.

I would like seaborn to plot in these instances the observations as fliers. That is: if the number of observations is below some threshold, plot those as fliers instead of a box.

I'm only going to include a fraction of the code to make this readable, since the code for the plot I show below is quite long.

sns.boxplot(data=data, x=days, y=error, ax=ax[1, 1], color='powderblue', width=0.7)

Where data would look something like this:

error days
-1 2
1 0
1 0
8 2
-1 2
-15 24
-1 19
-21 24
1 2
-2 1
8 0
-15 1
2 0
-8 1
3 0

So one would like to have a boxplot for the days 0, 1 and 2, which have more than two observations, but for 19 and 24 which only have 1 and 2 observations, plot them as fliers.

The plot is this. The bar plot at the top shows the amount of observations in each bin, as discussed earlier.

I could try to plot them separately, like taking off from data the observations that are causing me trouble, and scatter plot the rest separately, but that would create too much additional code, making it harder to read and longer to write, so I'm asking in case there is a simple and elegant solution that I'm unaware of.


Solution

  • Masking the valid/invalid data seems the way to go and is quite easy to do.

    You could write a custom function to combine a boxplot and a stripplot:

    def semi_boxplot(data, x, y, thresh=3, **kwargs):
        m = data.groupby(x).transform('size').ge(thresh)
        ax = sns.boxplot(x=data[x], y=data[y].where(m), **kwargs)
        sns.stripplot(x=data[x], y=data[y].mask(m), ax=ax,
                      edgecolor='grey', color='none', linewidth=1, size=6)
        return ax
    
    semi_boxplot(data, x='days', y='error', zorder=2, color='powderblue', width=0.7)
    

    Output:

    enter image description here

    Output with thresh=4:

    semi_boxplot(data, x='days', y='error', thresh=4,
                 zorder=2, color='powderblue', width=0.7)
    

    enter image description here