Search code examples
pythonpandasdata-manipulationsumifs

SUMIF equivalent with unique date ranges in Python (Summing if date falls within various date ranges for variable creation)


I am looking to create variables that sum based on date ranges unique to different features / categories to automate a current Excel task in Python. It is like a SUMIF in Excel but unique date ranges for different variables. I`ll try to recreate a similar situation as I cannot share the exact data. At the moment, I have a sales dataframe with sales per week by area like so:

Week Area Sales
08/02/2019 London 200
08/02/2019 Scotland 150
15/02/2019 London 100
15/02/2019 Scotland 120
22/02/2019 London 50
22/02/2019 Scotland 20

I want to incorporate whether the date falls within sales periods for products, so say I have another dataframe like this:

Product Sale Start Week Sale End Week
Boots 08/02/2019 15/02/2019
Accessories 15/02/2019 22/02/2019

I want to create something that sums if the dates fall within those specified for each product. For example, for Boots below, sum Sales if the weeks in Sales fall within the Sales Periods date range:

Area Boots Accessories
London 300 150
Scotland 270 140

I`ve tried groupby and a pivot table but I am not sure how to incorporate the sales dates filters into it. At the moment, the sales period dataframe and the sales dataframe are separate.

This is what I have for the pivot code which is almost there:

test = pd.pivot_table(df,index=['Area','Week'],columns=sales_period_df['Product'],values=['Sales'],aggfunc=np.sum)

But this doesnt include filtering for the sales periods and I`m not sure how to incorporate this. Would appreciate your advice, thanks in advance!


Solution

  • Due to overlapping periods, we can't use the classic pivoting in this case (unless we duplicate overlapping sales records for each period, wich seems too much). So we have to create this table manually.

    To start, let's prepare some data to work with:

    import pandas as pd
    from io import StringIO
    
    data = '''Week,Area,Sales
    08/02/2019,London,200
    08/02/2019,Scotland,150
    15/02/2019,London,100
    15/02/2019,Scotland,120
    22/02/2019,London,50
    22/02/2019,Scotland,20'''
    
    df = pd.read_csv(StringIO(data), index_col=0, parse_dates=True, dayfirst=True).sort_index()
    
    data = '''Product,Sale Start Week,Sale End Week
    Boots,08/02/2019,15/02/2019
    Accessories,15/02/2019,22/02/2019
    Something,08/02/2019,22/02/2019'''
    
    sales_period_df = pd.read_csv(StringIO(data), index_col=0, parse_dates=[1, 2], dayfirst=True)
    

    The structure of df and sales_period_df is slightly modified so that Week and Product are now indexes.

    Next, we prepare the output frame and supportive data:

    import pandas.IndexSlice as idx
    
    # create slices from sales_period_df
    # which can be used to locate data in df
    periods = sales_period_df.agg(lambda row: idx[row['Sale Start Week']:row['Sale End Week']], axis=1)
    
    # separate sales by area
    sales_by_area = df.groupby('Area')['Sales']
    
    # create the output DataFrame with unique areas as indexes 
    # and products as columns
    output = pd.DataFrame(index=df['Area'].unique(), columns=sales_period_df.index)
    

    To fill in the data, we can use eather apply or agg like this:

    for product in output.columns:
        output[product] = sales_by_area.agg(lambda sales: sales.loc[periods[product]].sum())
    

    Let's assemble the code:

    import pandas as pd
    from pandas import IndexSlice as idx
    from io import StringIO
    
    data = '''Week,Area,Sales
    08/02/2019,London,200
    08/02/2019,Scotland,150
    15/02/2019,London,100
    15/02/2019,Scotland,120
    22/02/2019,London,50
    22/02/2019,Scotland,20'''
    
    df = pd.read_csv(StringIO(data), index_col=0, parse_dates=True, dayfirst=True).sort_index()
    
    data = '''Product,Sale Start Week,Sale End Week
    Boots,08/02/2019,15/02/2019
    Accessories,15/02/2019,22/02/2019
    Something,08/02/2019,22/02/2019'''
    
    sales_period_df = pd.read_csv(StringIO(data), index_col=0, parse_dates=[1, 2], dayfirst=True)
    
    periods = sales_period_df.agg(lambda row: idx[row['Sale Start Week']:row['Sale End Week']], axis=1)
    output = pd.DataFrame(index=df['Area'].unique(), columns=sales_period_df.index)
    sales_by_area = df.groupby('Area')['Sales']
    
    for product in output.columns:
        output[product] = sales_by_area.agg(lambda sales: sales.loc[periods[product]].sum())
    
    print(output)
    

    Output:

    Product   Boots  Accessories  Something
    London      300          150        350
    Scotland    270          140        290