Search code examples
pythonpandassumifs

Excel SUMIF equivalent in Pandas


import pandas as pd 
import numpy as np

df = pd.DataFrame([['A', 201901, 10, 201801, 201801], 
                   ['B', 201902, 11, 201801, 201802], 
                   ['B', 201903, 13, 201801, 201803],
                   ['B', 201905, 18, 201801, 201805],
                   ['A', 201906, 80, 201801, 201806],
                   ['A', 202001, 10, 201901, 201901], 
                   ['A', 202002, 11, 201901, 201902], 
                   ['A', 202003, 13, 201901, 201903],
                   ['A', 202004, 18, 201901, 201904],
                   ['B', 202005, 80, 201901, 201905],
                   ['A', 202006, 80, 201901, 201906],
                   ['B', 201901, 10, 201801, 201801], 
                   ['A', 201902, 11, 201801, 201802], 
                   ['A', 201903, 13, 201801, 201803],
                   ['A', 201905, 18, 201801, 201805],
                   ['B', 201906, 80, 201801, 201806],
                   ['B', 202001, 10, 201901, 201901], 
                   ['B', 202002, 11, 201901, 201902], 
                   ['B', 202003, 13, 201901, 201903],
                   ['B', 202004, 18, 201901, 201904],
                   ['A', 202005 ,80, 201901, 201905],
                   ['B', 202006 ,80, 201901, 201906]],
                  columns = ['Store','yearweek','sales','Start_PY','PY'])
df

From the df (note that week 201904 is missing) above I would like to add a column 'Sales_PY' with on each line the sum of sales in the previous yearweeks per store. Something like this:

Store yearweek sales Start_PY PY sales_PY
A 201901 100 201801 201801 NaN
B 201902 11 201801 201802 NaN
B 201903 13 201801 201803 NaN
B 201905 18 201801 201805 NaN
A 201906 800 201801 201806 NaN
A 202001 100 201901 201901 100.0
A 202002 110 201901 201902 210.0
A 202003 130 201901 201903 340.0
A 202004 180 201901 201904 340.0
B 202005 80 201901 201905 52.0
A 202006 800 201901 201906 1320.0
B 201901 10 201801 201801 NaN
A 201902 110 201801 201802 NaN
A 201903 130 201801 201803 NaN
A 201905 180 201801 201805 NaN
B 201906 80 201801 201806 NaN
B 202001 10 201901 201901 10.0
B 202002 11 201901 201902 21.0
B 202003 13 201901 201903 34.0
B 202004 18 201901 201904 34.0
A 202005 800 201901 201905 520.0
B 202006 80 201901 201906 132.0

And I think there must be a SUMIF equivalent from Excel within Pandas.

I.e. Sales PY for the last row would be SUM of sales WHERE store == 'B' AND yearweek >= 201901 AND yearweek <= 201906. This equals 132.

Because I can't ensure that my df will be arranged by store/week and I sometimes have weeks missing in my df I wouldn't prefer using the shift() and/or cumsum() function.


Solution

  • Completely replaced the answer following clarifications from OP

    Note that the df you coded up is inconsistent with the df you printed in the table. I went with the one in the table

    The below is not the most elegant but I cannot think of a more vectorized operation given missing weeks etc

    We basically implement row-by-row calculation that follows sumif logic quite closely. The function in apply is applied to each row r For each row r it selects the relevant subset of the original dataframe df and calculates the sum

    df['Sales_PY'] = (df.apply(lambda r: df.loc[(df['yearweek'] >= r['Start_PY'])
                                               &(df['yearweek'] <= r['PY']) 
                                               &(df['Store']==r['Store']) ,'sales'].sum(),axis=1)
                    )
    

    output

        Store      yearweek    sales    Start_PY      PY    Sales_PY
    --  -------  ----------  -------  ----------  ------  ----------
     0  A            201901      100      201801  201801           0
     1  B            201902       11      201801  201802           0
     2  B            201903       13      201801  201803           0
     3  B            201905       18      201801  201805           0
     4  A            201906      800      201801  201806           0
     5  A            202001      100      201901  201901         100
     6  A            202002      110      201901  201902         210
     7  A            202003      130      201901  201903         340
     8  A            202004      180      201901  201904         340
     9  B            202005       80      201901  201905          52
    10  A            202006      800      201901  201906        1320
    11  B            201901       10      201801  201801           0
    12  A            201902      110      201801  201802           0
    13  A            201903      130      201801  201803           0
    14  A            201905      180      201801  201805           0
    15  B            201906       80      201801  201806           0
    16  B            202001       10      201901  201901          10
    17  B            202002       11      201901  201902          21
    18  B            202003       13      201901  201903          34
    19  B            202004       18      201901  201904          34
    20  A            202005      800      201901  201905         520
    21  B            202006       80      201901  201906         132
    

    If you want NaNs instead of 0s where you do not have sales data you can pass min_count=1 parameter in the sum above: .sum(min_count=1)