Search code examples
pandasmatplotlibseabornboxplot

How to customise pandas box and whisker plot with additional percentiles?


I am trying to produce the following plot, that was made in excel, by using pandas.

enter image description here

A lot of plotting at at work is done using excel and it is fiddly and tedious to get the data into the required format. I want to use pandas, but my bosses want to see the exact same (or very close) plots being produced.

I normally use seaborn for box plots and find it very convenient but I need to show more percentiles (5th, 10th, 25th, 50th, 75th, 90th, and 95th) as shown on the figure legend.

I know that seaborn/matplotlib lets me change the whiskers extent using whis=[10,90], and I can use showmean=True, but that leaves the other markers (95th and 5th percentiles) to add to each plot. How to overlay those?

I have my data grouped as I want, and I can extract the percentiles using .describe() as below pcntls=assay.groupby(['LocalSTRAT']).describe(percentiles=[0.1,0.05,0.25,0.5,0.75,0.9,0.95])

and transform which gives me this:

LocalSTRAT  AB  CBC     CLB     LAB     LBB     LBL     MS  TB  TBL     UAB     UBB
count   982.000000  234.000000  159.000000  530.000000  1136.000000     72.000000   267.000000  1741.000000     16.000000   1641.000000     2099.000000
mean    0.687658    1.410962    0.118302    0.211321    0.110251    0.077917    0.766124    0.262648    0.191875    0.119174    1.320357
std     0.814027    0.855342    0.148397    0.286574    0.146550    0.088921    0.647259    0.309134    0.125497    0.207197    1.393613
min     0.005000    0.005000    0.020000    0.005000    0.005000    0.010000    0.005000    0.005000    0.060000    0.005000    0.005000
5%  0.030000    0.196500    0.030000    0.020000    0.020000    0.020000    0.060000    0.020000    0.067500    0.005000    0.170000
10%     0.050000    0.363000    0.038000    0.020000    0.020000    0.021000    0.096000    0.030000    0.070000    0.020000    0.230000
25%     0.130000    0.825000    0.045000    0.050000    0.030000    0.030000    0.225000    0.050000    0.077500    0.030000    0.450000
50%     0.400000    1.260000    0.070000    0.120000    0.050000    0.050000    0.610000    0.150000    0.175000    0.060000    0.940000
75%     0.950000    1.947500    0.140000    0.250000    0.120000    0.072500    1.120000    0.350000    0.257500    0.130000    1.570000
90%     1.720000    2.411000    0.262000    0.520000    0.265000    0.149000    1.624000    0.640000    0.340000    0.250000    2.770000
95%     2.370000    2.967500    0.322000    0.685500    0.390000    0.237000    2.037000    0.880000    0.390000    0.410000    4.322000
max     7.040000    5.070000    1.510000    2.620000    1.450000    0.580000    3.530000    2.390000    0.480000    4.190000    11.600000

I am stuck with how to begin to construct boxplots from scratch using this output.

I think it's easier to construct some boxplots in the normal way, and then add the extra couple of data points (5th and 95th percentile markers) over the top but can't figure out how to do this.

(Bonus points for a way to make a legend like the one shown or how to insert an image file of this into my plot, and get the log-style gridlines, and include the counts in the x axis!)


Solution

  • Just overlay a scatter plot using the percentiles extracted from the .describe() output, remembering to sort both to make sure the order doesn't get mixed up. The legend was made as an image externally and inserted separately.

    Counts were calculated and added using plt.text().

    Logarithmic gridlines applied using plt.grid(True, which='both') and setting the axis to log.

    code and result below.

    import pandas as pd
    import seaborn as sns
    import matplotlib
    import matplotlib.pyplot as plt
    
    
    pathx = r"C:\boxplots2.xlsx"
    
    pathx =  pathx.replace( "\\", "/")#avoid escape character issues
    #print pathx
    #pathx = pathx[1:len(pathx)-1]
    df=pd.read_excel(pathx)
    
    #this line removes missing data rows (where the strat is not specified)
    df=df[df["STRAT"]!=0]
    
    
    
    assay=df
    
    factor_to_plot='Total %S'
    f=factor_to_plot
    
    x_axis_factor='STRAT'
    g=x_axis_factor
    
    pcntls=assay.groupby([g]).describe(percentiles=[0.05,0.1,0.25,0.5,0.75,0.9,0.95])
    sumry= pcntls[f].T
    #print sumry
    ordered=sorted(assay[g].dropna().unique())
    
    #set figure size and scale text
    plt.rcParams['figure.figsize']=(15,10)
    text_scaling=1.9
    sns.set(style="whitegrid")
    sns.set_context("paper", font_scale=text_scaling) 
    
    #plot boxplot
    ax=sns.boxplot(x=assay[g],y=assay[f],width=0.5,order=ordered, whis=[10,90],data=assay, showfliers=False,color='lightblue', 
                showmeans=True,meanprops={"marker":"x","markersize":12,"markerfacecolor":"white", "markeredgecolor":"black"})
    
    plt.axhline(0.3, color='green',linestyle='dashed', label="S%=0.3")
    
    #this line sets the scale to logarithmic
    ax.set_yscale('log')
    
    leg= plt.legend(markerscale=1.5,bbox_to_anchor=(1.0, 0.5) )#,bbox_to_anchor=(1.0, 0.5)
    #plt.title("Assay data")
    plt.grid(True, which='both')
    ax.scatter(x=sorted(list(sumry.columns.values)),y=sumry.loc['5%'],s=120,color='white',edgecolor='black') 
    ax.scatter(x=sorted(list(sumry.columns.values)),y=sumry.loc['95%'],s=120,color='white',edgecolor='black')
    
    
    #add legend image
    img = plt.imread("legend.jpg")
    plt.figimage(img, 1900,900, zorder=1, alpha=1)
    
    
    #next line is important, select a column that has no blanks or nans as the total items are counted. 
    assay['value']=assay['From']
    
    vals=assay.groupby([g])['value'].count()
    j=vals
    
    ymin, ymax = ax.get_ylim()
    xmin, xmax = ax.get_xlim()
    #print ymax
    
    #put n= values at top of plot    
    x=0
    for i in range(len(j)):
    
        plt.text(x = x , y = ymax+0.2, s = "N=\n" +str(int(j[i])),horizontalalignment='center')
        #plt.text(x = x , y = 102.75, s = "n=",horizontalalignment='center')
        x+=1
    
    
    
    
    
    #use the section below to adjust the y axis lable format to avoid default of 10^0 etc for log scale plots.
    ylabels = ['{:.1f}'.format(y) for y in ax.get_yticks()]
    ax.set_yticklabels(ylabels)
    

    Which gives:

    enter image description here