Search code examples
pythonpandasplotlineboxplot

How to connect boxplot median values


It seems like plotting a line connecting the mean values of box plots would be a simple thing to do, but I couldn't figure out how to do this plot in pandas.

I'm using this syntax to do the boxplot so that it automatically generate the box plot for Y vs. X device without having to do external manipulation of the data frame:

df.boxplot(column='Y_Data', by="Category", showfliers=True, showmeans=True)

enter image description here

One way I thought of doing is to just do a line plot by getting the mean values from the boxplot, but I'm not sure how to extract that information from the plot.


Solution

  • You can save the axis object that gets returned from df.boxplot(), and plot the means as a line plot using that same axis. I'd suggest using Seaborn's pointplot for the lines, as it handles a categorical x-axis nicely.

    First let's generate some sample data:

    import pandas as pd
    import numpy as np
    import seaborn as sns
    
    N = 150
    values = np.random.random(size=N)
    groups = np.random.choice(['A','B','C'], size=N)
    df = pd.DataFrame({'value':values, 'group':groups})
    
    print(df.head())
      group     value
    0     A  0.816847
    1     A  0.468465
    2     C  0.871975
    3     B  0.933708
    4     A  0.480170
                  ...
    

    Next, make the boxplot and save the axis object:

    ax = df.boxplot(column='value', by='group', showfliers=True, 
                    positions=range(df.group.unique().shape[0]))
    

    Note: There's a curious positions argument in Pyplot/Pandas boxplot(), which can cause off-by-one errors. See more in this discussion, including the workaround I've employed here.

    Finally, use groupby to get category means, and then connect mean values with a line plot overlaid on top of the boxplot:

    sns.pointplot(x='group', y='value', data=df.groupby('group', as_index=False).mean(), ax=ax)
    

    boxplot

    Your title mentions "median" but you talk about category means in your post. I used means here; change the groupby aggregation to median() if you want to plot medians instead.