Search code examples
pythonexcelpandasdataframesumifs

Emulating an excel sumifs / multi-criteria index match with dataframes in python / pandas


I have two dataframes in python using pandas:

  • df1: [233,500 x 6] ledger of prices by date
  • df2: [1,665,997 x 5] ledger of customer usage by date

On a smaller data set in excel I can simply do the following to create a column on df2 containing each products price on a given date, by row.

=SUMIFS(df1[Rate],df1[Date],[@Date],df1[Jurisdiction],[@Jurisdiction],df1[Product],[@Product])

Ultimately I want to add a column to df2 (or make a new dataframe with the result) that provides price by matching some criteria in each row (Jurisdiction, Product Type) with the same criteria in the price ledger df1.

What's the most appropriate way to associate this data using python and dataframes? Some sort of dictionary and a join?

The examples I've found are mainly dealing with summing given some conditions:

Bonus: The dates in df1 and df2 will not always match identically. Matching the date in df2 with the most recent price ledger date will need to happen.

Edit: I've include simplified data below to demonstrate how I'm trying to do the most-recent date matching. The date matching is just a middle step towards finding the active price on each date in df2. The direct merge on the two date columns doesn't work because prices are not provided every day.

df1:
Date       Price
1/11/2016  5.00
1/12/2016  5.50
1/13/2016  6.00
1/14/2016  7.00
1/16/2016  8.00
1/20/2016  9.00
1/21/2016  10.00
1/22/2016  11.00

df2:
Date       Volume
1/11/2016  100
1/15/2016  100
1/17/2016  200
1/18/2016  300
1/20/2016  200

df3: (df2 with date matching. Cost = Volume*Price)
Date       Volume  MatchedDate  Price  Cost
1/11/2016  100     1/11/2016    5.00   500
1/15/2016  100     1/14/2016    7.00   700
1/17/2016  200     1/16/2016    8.00   1600
1/18/2016  300     1/16/2016    8.00   2400
1/20/2016  200     1/20/2016    9.00   1800

Edit 2: The formula provided in the first section in the following works, in excel, with some extra logic to use the first date for df2 entries preceding the first df1 entry. https://www.extendoffice.com/documents/excel/2601-excel-find-closest-date.html

{=MAX((df1[Date]<[@Date])*df1[Date])}

Solution

  • Simply merge the two sets on matching criteria and then run a groupby sum:

    merged_df = pd.merge(df1, df2, on=['date', 'Jurisdiction', 'Product'])
    
    merged_df.groupby(['date', 'Jurisdiction', 'Product'])['Rate'].sum()
    

    Or with most recent date of df2:

    most_recent_df2 = df2.merge(df2.groupby(['Jurisdiction', 'Product'])['date'].max().reset_index(), 
                                on=['date', 'Jurisdiction', 'Product'])
    
    merged_df = pd.merge(df1, most_recent_df2, on=['Jurisdiction', 'Product'], suffixes=['', '_'])
    
    merged_df.groupby(['date', 'Jurisdiction', 'Product'])['Rate'].sum()