I am looking for an iterative way of creating a subset of my pandas dataframe based on the date and product. I would like to keep the first row for each product within a 2 week window.
So for df A:
Date,Product,Return
1/1/2020,ABC,0.00993
1/2/2020,ABC,0.04231
1/4/2020,ABC,0.04231
1/30/2020,ABC,0.04231
2/20/2020,ABC,0.01408
6/15/2020,XYZ,0.04868
6/16/2020,XYZ,0.05284
6/19/2020,XYZ,0.05284
6/25/2020,XYZ,0.01578
8/25/2020,XYZ,0.03248
9/25/2020,XYZ,0.03248
10/12/2020,XYZ,0.0375
12/2/2020,XYZ,0.02589
6/11/2020,EFG,0.02589
7/13/2020,EFG,0.02589
7/17/2020,EFG,0.02859
7/21/2020,EFG,0.02084
7/27/2020,EFG,0.05154
7/29/2020,EFG,0.05154
9/8/2020,EFG,0.0616
9/14/2020,EFG,0.04092
9/18/2020,EFG,0.01578
9/22/2020,EFG,0.03248
6/9/2020,ASD,0.03248
I want DF B returned:
Date,Product,Return
1/1/2020,ABC,0.00993
1/30/2020,ABC,0.04231
2/20/2020,ABC,0.01408
6/15/2020,XYZ,0.04868
8/25/2020,XYZ,0.03248
9/25/2020,XYZ,0.03248
10/12/2020,XYZ,0.0375
12/2/2020,XYZ,0.02589
6/11/2020,EFG,0.02589
7/13/2020,EFG,0.02589
7/27/2020,EFG,0.05154
9/8/2020,EFG,0.0616
6/9/2020,ASD,0.03248
My total dataframe has 10k products, I tried using .loc to create a variable based on the datetime/time delta but it could be based on dates for prior products
You need some way to group them by the week difference. I suggest converting the date to week of year (52 week format), group by product, and get the diff()
between each week of that product. Using this we can calculate which diffs are greater than one and use cumsum()
to increment the group so they aren't together. The final columns 'c' is you extra grouping column. Group on product
and c
and use .head(1)
to get the first value per group.
df = pd.DataFrame({'Date': ['1/1/2020','1/2/2020','1/4/2020','1/30/2020',
'2/20/2020','6/15/2020','6/16/2020','6/19/2020','6/25/2020',
'8/25/2020','9/25/2020','10/12/2020','12/2/2020','6/11/2020',
'7/13/2020','7/17/2020','7/21/2020','7/27/2020', '7/29/2020',
'9/8/2020','9/14/2020','9/18/2020','9/22/2020','6/9/2020'],
'Product': ['ABC','ABC','ABC','ABC','ABC','XYZ','XYZ',
'XYZ','XYZ','XYZ','XYZ','XYZ','XYZ','EFG','EFG','EFG',
'EFG','EFG','EFG','EFG','EFG','EFG','EFG','ASD'],
'Return': [0.00993,0.04231,0.04231,0.04231,0.01408,0.04868,
0.05284,0.05284,0.015780000000000002,0.03248,0.03248,
0.0375, 0.025889999999999996,0.025889999999999996,
0.025889999999999996,0.028589999999999997,
0.02084,0.051539999999999996,0.051539999999999996,
0.0616,0.04092,0.015780000000000002,0.03248,0.03248]})
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(by='Date').reset_index(drop=True)
df['week'] = df['Date'].dt.isocalendar().week
df['c'] = df.groupby('Product')['week'].diff().fillna(0).gt(1).astype(int).cumsum()
df = df.groupby(['Product','c']).head(1)
df.drop(columns=['week','c'], inplace=True)
Output
Date Product Return
0 2020-01-01 ABC 0.00993
3 2020-01-30 ABC 0.04231
4 2020-02-20 ABC 0.01408
5 2020-06-09 ASD 0.03248
6 2020-06-11 EFG 0.02589
7 2020-06-15 XYZ 0.04868
11 2020-07-13 EFG 0.02589
16 2020-08-25 XYZ 0.03248
17 2020-09-08 EFG 0.06160
21 2020-09-25 XYZ 0.03248
22 2020-10-12 XYZ 0.03750
23 2020-12-02 XYZ 0.02589