I am looking to create variables that sum based on date ranges unique to different features / categories to automate a current Excel task in Python. It is like a SUMIF in Excel but unique date ranges for different variables. I`ll try to recreate a similar situation as I cannot share the exact data. At the moment, I have a sales dataframe with sales per week by area like so:
Week | Area | Sales |
---|---|---|
08/02/2019 | London | 200 |
08/02/2019 | Scotland | 150 |
15/02/2019 | London | 100 |
15/02/2019 | Scotland | 120 |
22/02/2019 | London | 50 |
22/02/2019 | Scotland | 20 |
I want to incorporate whether the date falls within sales periods for products, so say I have another dataframe like this:
Product | Sale Start Week | Sale End Week |
---|---|---|
Boots | 08/02/2019 | 15/02/2019 |
Accessories | 15/02/2019 | 22/02/2019 |
I want to create something that sums if the dates fall within those specified for each product. For example, for Boots below, sum Sales if the weeks in Sales fall within the Sales Periods date range:
Area | Boots | Accessories |
---|---|---|
London | 300 | 150 |
Scotland | 270 | 140 |
I`ve tried groupby and a pivot table but I am not sure how to incorporate the sales dates filters into it. At the moment, the sales period dataframe and the sales dataframe are separate.
This is what I have for the pivot code which is almost there:
test = pd.pivot_table(df,index=['Area','Week'],columns=sales_period_df['Product'],values=['Sales'],aggfunc=np.sum)
But this doesnt include filtering for the sales periods and I`m not sure how to incorporate this. Would appreciate your advice, thanks in advance!
Due to overlapping periods, we can't use the classic pivoting in this case (unless we duplicate overlapping sales records for each period, wich seems too much). So we have to create this table manually.
To start, let's prepare some data to work with:
import pandas as pd
from io import StringIO
data = '''Week,Area,Sales
08/02/2019,London,200
08/02/2019,Scotland,150
15/02/2019,London,100
15/02/2019,Scotland,120
22/02/2019,London,50
22/02/2019,Scotland,20'''
df = pd.read_csv(StringIO(data), index_col=0, parse_dates=True, dayfirst=True).sort_index()
data = '''Product,Sale Start Week,Sale End Week
Boots,08/02/2019,15/02/2019
Accessories,15/02/2019,22/02/2019
Something,08/02/2019,22/02/2019'''
sales_period_df = pd.read_csv(StringIO(data), index_col=0, parse_dates=[1, 2], dayfirst=True)
The structure of df
and sales_period_df
is slightly modified so that Week
and Product
are now indexes.
Next, we prepare the output frame and supportive data:
import pandas.IndexSlice as idx
# create slices from sales_period_df
# which can be used to locate data in df
periods = sales_period_df.agg(lambda row: idx[row['Sale Start Week']:row['Sale End Week']], axis=1)
# separate sales by area
sales_by_area = df.groupby('Area')['Sales']
# create the output DataFrame with unique areas as indexes
# and products as columns
output = pd.DataFrame(index=df['Area'].unique(), columns=sales_period_df.index)
To fill in the data, we can use eather apply
or agg
like this:
for product in output.columns:
output[product] = sales_by_area.agg(lambda sales: sales.loc[periods[product]].sum())
Let's assemble the code:
import pandas as pd
from pandas import IndexSlice as idx
from io import StringIO
data = '''Week,Area,Sales
08/02/2019,London,200
08/02/2019,Scotland,150
15/02/2019,London,100
15/02/2019,Scotland,120
22/02/2019,London,50
22/02/2019,Scotland,20'''
df = pd.read_csv(StringIO(data), index_col=0, parse_dates=True, dayfirst=True).sort_index()
data = '''Product,Sale Start Week,Sale End Week
Boots,08/02/2019,15/02/2019
Accessories,15/02/2019,22/02/2019
Something,08/02/2019,22/02/2019'''
sales_period_df = pd.read_csv(StringIO(data), index_col=0, parse_dates=[1, 2], dayfirst=True)
periods = sales_period_df.agg(lambda row: idx[row['Sale Start Week']:row['Sale End Week']], axis=1)
output = pd.DataFrame(index=df['Area'].unique(), columns=sales_period_df.index)
sales_by_area = df.groupby('Area')['Sales']
for product in output.columns:
output[product] = sales_by_area.agg(lambda sales: sales.loc[periods[product]].sum())
print(output)
Output:
Product Boots Accessories Something
London 300 150 350
Scotland 270 140 290