I have a multiindex dataframe with sales data for different regions, sizes, and dates. I want to calculate the "worldwide" (over all regions) sum of sales, by size, for each date, then assign that to a column in the original dataframe, with each worldwide value for sales and size broadcast to every region. I thought I could group by size and date, then cumsum() and then take the answer and reindex it to the original dataframe, but that doesn't seem to work.
Here's the setup code:
import pandas as pd
#Create the dataframe 'df'
regions=['NorthAm']*9
regions.extend(['APAC']*9)
regions.extend(['Eur']*9)
sizes=[12]*3
sizes.extend([14]*3)
sizes.extend([16]*3)
sizes=sizes*3
dates=['1/1/2011','1/2/2011','1/3/2011']*27
idx=zip(regions,sizes,dates)
idx=pd.MultiIndex.from_tuples(idx, names=['Region','Size','Date'])
df=pd.DataFrame(np.arange(27), index = idx, columns=['Sales'])
# Check it
df
Out[1]:
Sales
Region Size Date
NorthAm 12 1/1/2011 0
1/2/2011 1
1/3/2011 2
14 1/1/2011 3
1/2/2011 4
1/3/2011 5
16 1/1/2011 6
1/2/2011 7
1/3/2011 8
APAC 12 1/1/2011 9
1/2/2011 10
1/3/2011 11
14 1/1/2011 12
1/2/2011 13
1/3/2011 14
16 1/1/2011 15
1/2/2011 16
1/3/2011 17
Eur 12 1/1/2011 18
1/2/2011 19
1/3/2011 20
14 1/1/2011 21
1/2/2011 22
1/3/2011 23
16 1/1/2011 24
1/2/2011 25
1/3/2011 26
# Yes, that's right. Now create the cumulative sum, regardless of region
cs=df.groupby(level=['Size','Date']).sum().groupby(level=0).cumsum()
# Check it.
cs
Out[1]:
Sales
Size Date
12 1/1/2011 27
1/2/2011 57
1/3/2011 90
14 1/1/2011 36
1/2/2011 75
1/3/2011 117
16 1/1/2011 45
1/2/2011 93
1/3/2011 144
Now I'd like to do something like:
df['WWSales']=cd.reindex(df, method='???')
to get something like:
Out[2]:
Sales WWSales
Region Size Date
NorthAm 12 1/1/2011 0 27
1/2/2011 1 57
1/3/2011 2 90
14 1/1/2011 3 36
1/2/2011 4 75
1/3/2011 5 115
16 1/1/2011 6 45
1/2/2011 7 93
1/3/2011 8 144
APAC 12 1/1/2011 9 27
1/2/2011 10 57
1/3/2011 11 90
14 1/1/2011 12 36
1/2/2011 13 75
1/3/2011 14 115
16 1/1/2011 15 45
1/2/2011 16 93
1/3/2011 17 144
Eur 12 1/1/2011 18 27
1/2/2011 19 57
1/3/2011 20 90
14 1/1/2011 21 36
1/2/2011 22 75
1/3/2011 23 115
16 1/1/2011 24 45
1/2/2011 25 93
1/3/2011 26 144
I suspect there's some really obvious solution, but I'm not seeing it (and not finding it in searches). Any help would be greatly appreciated.
By the way, bonus points for an elegant way to make the date column into a DateTimeIndex. The following works, but seems inelegant:
df.index.levels[2]=pd.tseries.period.DatetimeIndex(df.index.levels[2])
Instead of (implicitly) applying sum
and cumsum
, use the transform
method on each group. That is:
df['WWSales'] = df.groupby(level=['Size','Date']).transform(np.sum).groupby(level=0).transform(np.cumsum)
That gives your desired output.
See http://pandas.pydata.org/pandas-docs/dev/groupby.html#transformation
I'll take a crack at your bonus question: you could map Timestamp
onto your list.
dates = map(pd.Timestamp, ['1/1/2011','1/2/2011','1/3/2011']*27)
But I expect there is a neater way -- I cede that question to more experienced voices....