import pandas as pd
import numpy as np
df = pd.DataFrame([['A', 201901, 10, 201801, 201801],
['B', 201902, 11, 201801, 201802],
['B', 201903, 13, 201801, 201803],
['B', 201905, 18, 201801, 201805],
['A', 201906, 80, 201801, 201806],
['A', 202001, 10, 201901, 201901],
['A', 202002, 11, 201901, 201902],
['A', 202003, 13, 201901, 201903],
['A', 202004, 18, 201901, 201904],
['B', 202005, 80, 201901, 201905],
['A', 202006, 80, 201901, 201906],
['B', 201901, 10, 201801, 201801],
['A', 201902, 11, 201801, 201802],
['A', 201903, 13, 201801, 201803],
['A', 201905, 18, 201801, 201805],
['B', 201906, 80, 201801, 201806],
['B', 202001, 10, 201901, 201901],
['B', 202002, 11, 201901, 201902],
['B', 202003, 13, 201901, 201903],
['B', 202004, 18, 201901, 201904],
['A', 202005 ,80, 201901, 201905],
['B', 202006 ,80, 201901, 201906]],
columns = ['Store','yearweek','sales','Start_PY','PY'])
df
From the df
(note that week 201904 is missing) above I would like to add a column 'Sales_PY'
with on each line the sum of sales in the previous yearweeks per store.
Something like this:
Store | yearweek | sales | Start_PY | PY | sales_PY |
---|---|---|---|---|---|
A | 201901 | 100 | 201801 | 201801 | NaN |
B | 201902 | 11 | 201801 | 201802 | NaN |
B | 201903 | 13 | 201801 | 201803 | NaN |
B | 201905 | 18 | 201801 | 201805 | NaN |
A | 201906 | 800 | 201801 | 201806 | NaN |
A | 202001 | 100 | 201901 | 201901 | 100.0 |
A | 202002 | 110 | 201901 | 201902 | 210.0 |
A | 202003 | 130 | 201901 | 201903 | 340.0 |
A | 202004 | 180 | 201901 | 201904 | 340.0 |
B | 202005 | 80 | 201901 | 201905 | 52.0 |
A | 202006 | 800 | 201901 | 201906 | 1320.0 |
B | 201901 | 10 | 201801 | 201801 | NaN |
A | 201902 | 110 | 201801 | 201802 | NaN |
A | 201903 | 130 | 201801 | 201803 | NaN |
A | 201905 | 180 | 201801 | 201805 | NaN |
B | 201906 | 80 | 201801 | 201806 | NaN |
B | 202001 | 10 | 201901 | 201901 | 10.0 |
B | 202002 | 11 | 201901 | 201902 | 21.0 |
B | 202003 | 13 | 201901 | 201903 | 34.0 |
B | 202004 | 18 | 201901 | 201904 | 34.0 |
A | 202005 | 800 | 201901 | 201905 | 520.0 |
B | 202006 | 80 | 201901 | 201906 | 132.0 |
And I think there must be a SUMIF
equivalent from Excel within Pandas.
I.e. Sales PY for the last row would be SUM of sales WHERE store == 'B' AND yearweek >= 201901 AND yearweek <= 201906. This equals 132.
Because I can't ensure that my df will be arranged by store/week and I sometimes have weeks missing in my df I wouldn't prefer using the shift() and/or cumsum() function.
Note that the df you coded up is inconsistent with the df you printed in the table. I went with the one in the table
The below is not the most elegant but I cannot think of a more vectorized operation given missing weeks etc
We basically implement row-by-row calculation that follows sumif
logic quite closely. The function in apply
is applied to each row r
For each row r
it selects the relevant subset of the original dataframe df
and calculates the sum
df['Sales_PY'] = (df.apply(lambda r: df.loc[(df['yearweek'] >= r['Start_PY'])
&(df['yearweek'] <= r['PY'])
&(df['Store']==r['Store']) ,'sales'].sum(),axis=1)
)
output
Store yearweek sales Start_PY PY Sales_PY
-- ------- ---------- ------- ---------- ------ ----------
0 A 201901 100 201801 201801 0
1 B 201902 11 201801 201802 0
2 B 201903 13 201801 201803 0
3 B 201905 18 201801 201805 0
4 A 201906 800 201801 201806 0
5 A 202001 100 201901 201901 100
6 A 202002 110 201901 201902 210
7 A 202003 130 201901 201903 340
8 A 202004 180 201901 201904 340
9 B 202005 80 201901 201905 52
10 A 202006 800 201901 201906 1320
11 B 201901 10 201801 201801 0
12 A 201902 110 201801 201802 0
13 A 201903 130 201801 201803 0
14 A 201905 180 201801 201805 0
15 B 201906 80 201801 201806 0
16 B 202001 10 201901 201901 10
17 B 202002 11 201901 201902 21
18 B 202003 13 201901 201903 34
19 B 202004 18 201901 201904 34
20 A 202005 800 201901 201905 520
21 B 202006 80 201901 201906 132
If you want NaN
s instead of 0s where you do not have sales data you can pass min_count=1
parameter in the sum
above: .sum(min_count=1)