python pandas group-by bar-chart stacked

How to plot a stacked bar chart using pandas python

I have 3 dataframes for yearly data (one for 2014, 2015 and 2016), each having 3 columns named, 'PRACTICE', 'BNF NAME', 'ITEMS'.

BNF NAME refers to drug names and I am picking out 3 Ampicillin, Amoxicillin and Co-Amoxiclav. This column has different strengths/dosages (e.g Co-Amoxiclav 200mg or Co-Amoxiclav 300mg etc etc) that I want to ignore, so I have used str.contains() to select these 3 drugs. ITEMS is the total number of prescriptions written for each drug.

I want to create a stacked bar chart with the x axis being year (2014, 2014, 2015) and the y axis being total number of prescriptions, and each of the 3 bars to be split up into 3 for each drug name.

I am assuming I need to use df.groupby() and select a partial string maybe, however I am unsure how to combine the yearly data and then how to group the data to create the stacked bar chart.

Any guidance would be much appreciated.

This is the line of code I am using to select the rows for the 3 drug names only.

frame=frame[frame['BNF NAME'].str.contains('Ampicillin' and 'Amoxicillin' and 'Co-Amoxiclav')]

This is what each of the dataframes resembles:

PRACTICE |  BNF NAME |  ITEMS
Y00327 | Co-Amoxiclav_Tab 250mg/125mg | 23
Y00327 | Co-Amoxiclav_Susp 125mg/31mg/5ml S/F | 10
Y00327 | Co-Amoxiclav_Susp 250mg/62mg/5ml S/F | 6
Y00327 | Co-Amoxiclav_Susp 250mg/62mg/5ml | 1
Y00327 | Co-Amoxiclav_Tab 500mg/125mg | 50

Solution

There are likely going to be a few different ways in which you could accomplish this. Here's how I would do it. I'm using a jupyter notebook, so your matplotlib imports may be different.

import pandas as pd
%matplotlib
import matplotlib.pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')

df = pd.DataFrame({'PRACTICE': ['Y00327', 'Y00327', 'Y00327', 'Y00327', 'Y00327'],
                   'BNF NAME': ['Co-Amoxiclav_Tab 250mg/125mg', 'Co-Amoxiclav_Susp 125mg/31mg/5ml S/F',
                                'Co-Amoxiclav_Susp 250mg/62mg/5ml S/F', 'Ampicillin 250mg/62mg/5ml',
                               'Amoxicillin_Tab 500mg/125mg'],
                   'ITEMS': [23, 10, 6, 1, 50]})



Out[52]:
BNF NAME    ITEMS   PRACTICE
0   Co-Amoxiclav_Tab 250mg/125mg    23  Y00327
1   Co-Amoxiclav_Susp 125mg/31mg/5ml S/F    10  Y00327
2   Co-Amoxiclav_Susp 250mg/62mg/5ml S/F    6   Y00327
3   Ampicillin 250mg/62mg/5ml   1   Y00327
4   Amoxicillin_Tab 500mg/125mg 50  Y00327

To simulate your three dataframes:

df1 = df.copy()
df2 = df.copy()
df3 = df.copy()

Set a column indicating what year the dataframe represents.

df1['YEAR'] = 2014
df2['YEAR'] = 2015
df3['YEAR'] = 2016

Combining the three dataframes:

combined_df = pd.concat([df1, df2, df3], ignore_index=True)

To set what drug each row represents:

combined_df['parsed_drug_name'] = "" # creates a blank column
amp_bool = combined_df['BNF NAME'].str.contains('Ampicillin', case=False)
combined_df.loc[amp_bool, 'parsed_drug_name'] = 'Ampicillin' # sets the row to amplicillin, if BNF NAME contains 'ampicillin.'

amox_bool = combined_df['BNF NAME'].str.contains('Amoxicillin', case=False)
combined_df.loc[amox_bool, 'parsed_drug_name'] = 'Amoxicillin'

co_amox_bool = combined_df['BNF NAME'].str.contains('Co-Amoxiclav', case=False)
combined_df.loc[co_amox_bool, 'parsed_drug_name'] = 'Co-Amoxiclav'

Finally, perform a pivot on the data, and plot the results:

combined_df.pivot_table(index='YEAR', columns='parsed_drug_name', values='ITEMS', aggfunc='sum').plot.bar(rot=0, stacked=True)