My goal is to summarize data by size within years of my dataset. I am able to do these two tasks separately (e.g. summarize by year, or summarize by bin) but am having syntax trouble combining the two.
Below is how I can summarize my data by year:
size_summary = df_raw.groupby(['Year'])['Quantity'].describe()
Below is how I create my bins
mult = 1
bins = [5*mult, 10*mult, 25*mult, 50*mult, 100*mult]
groups = df_raw.groupby(pd.cut(df_raw['Quantity'], bins))
When I try to combine the two in the below, I get an error message. Does anyone know how to combine this to reach my goal? Thank you for your help.
groups.groupby(['Year'])['Quantity'].describe()
AttributeError: Cannot access callable attribute 'groupby' of 'DataFrameGroupBy' objects, try using the 'apply' method
EDIT: Adding sample data as requested, below.
df_raw = pd.DataFrame(data={
'Year': [2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014],
'Quantity': [2.0, 3.0, 78.8, 65.7, 70.0, 61.9, 83.9, 39.7, 44.1, 14.5, 35.3, 82.2, 13.9, 66.6, 65.8, 94.8, 50.8, 17.1, 9.9, 51.1, 62.9, 63.0, 13.5, 37.6, 1.5, 70.7, 23.3, 28.1, 21.9, 60.7, 1.1, 67.2, 0.4, 81.4, 86.7, 36.2, 45.2, 50.4, 43.3]
})
Desired output is in the format below - apologies for the screenshot.
As an alternative to pivot_table
, you can group by the bins and year, and then reshape your data via unstack
:
# first group by bins, then by year
groups = df_raw.groupby([pd.cut(df_raw['Quantity'], bins), 'Year'])
# compute group size, pivot into the shape you want
counts = groups.size().unstack(fill_value=0)
counts
Year 2012 2013 2014
Quantity
(5, 10] 0 1 0
(10, 25] 2 3 1
(25, 50] 3 2 3
(50, 100] 7 7 5
This is approximately 2.5x faster than pivot_table
on the sample data you provided.
To split the categorical interval index into a MultiIndex
, use something like
def interval_to_tuple(interval):
return interval.left, interval.right
counts.set_index(
counts.index.astype(object).map(interval_to_tuple).rename(['Lower', 'Upper']))
Year 2012 2013 2014
Lower Upper
5 10 0 1 0
10 25 2 3 1
25 50 3 2 3
50 100 7 7 5
You should be able to export this result into Excel without issues.