Search code examples
pythonpandasapplypandas-loc

Python: Return First value based on dates for each product


I am looking for an iterative way of creating a subset of my pandas dataframe based on the date and product. I would like to keep the first row for each product within a 2 week window.

So for df A:

Date,Product,Return
1/1/2020,ABC,0.00993
1/2/2020,ABC,0.04231
1/4/2020,ABC,0.04231
1/30/2020,ABC,0.04231
2/20/2020,ABC,0.01408
6/15/2020,XYZ,0.04868
6/16/2020,XYZ,0.05284
6/19/2020,XYZ,0.05284
6/25/2020,XYZ,0.01578
8/25/2020,XYZ,0.03248
9/25/2020,XYZ,0.03248
10/12/2020,XYZ,0.0375
12/2/2020,XYZ,0.02589
6/11/2020,EFG,0.02589
7/13/2020,EFG,0.02589
7/17/2020,EFG,0.02859
7/21/2020,EFG,0.02084
7/27/2020,EFG,0.05154
7/29/2020,EFG,0.05154
9/8/2020,EFG,0.0616
9/14/2020,EFG,0.04092
9/18/2020,EFG,0.01578
9/22/2020,EFG,0.03248
6/9/2020,ASD,0.03248

I want DF B returned:

Date,Product,Return
1/1/2020,ABC,0.00993
1/30/2020,ABC,0.04231
2/20/2020,ABC,0.01408
6/15/2020,XYZ,0.04868
8/25/2020,XYZ,0.03248
9/25/2020,XYZ,0.03248
10/12/2020,XYZ,0.0375
12/2/2020,XYZ,0.02589
6/11/2020,EFG,0.02589
7/13/2020,EFG,0.02589
7/27/2020,EFG,0.05154
9/8/2020,EFG,0.0616
6/9/2020,ASD,0.03248

My total dataframe has 10k products, I tried using .loc to create a variable based on the datetime/time delta but it could be based on dates for prior products


Solution

  • You need some way to group them by the week difference. I suggest converting the date to week of year (52 week format), group by product, and get the diff() between each week of that product. Using this we can calculate which diffs are greater than one and use cumsum() to increment the group so they aren't together. The final columns 'c' is you extra grouping column. Group on product and c and use .head(1) to get the first value per group.

    df = pd.DataFrame({'Date': ['1/1/2020','1/2/2020','1/4/2020','1/30/2020',
      '2/20/2020','6/15/2020','6/16/2020','6/19/2020','6/25/2020',
      '8/25/2020','9/25/2020','10/12/2020','12/2/2020','6/11/2020',
      '7/13/2020','7/17/2020','7/21/2020','7/27/2020', '7/29/2020',
      '9/8/2020','9/14/2020','9/18/2020','9/22/2020','6/9/2020'],
     'Product': ['ABC','ABC','ABC','ABC','ABC','XYZ','XYZ',
      'XYZ','XYZ','XYZ','XYZ','XYZ','XYZ','EFG','EFG','EFG',
      'EFG','EFG','EFG','EFG','EFG','EFG','EFG','ASD'],
     'Return': [0.00993,0.04231,0.04231,0.04231,0.01408,0.04868,
      0.05284,0.05284,0.015780000000000002,0.03248,0.03248,
      0.0375, 0.025889999999999996,0.025889999999999996,
      0.025889999999999996,0.028589999999999997,
      0.02084,0.051539999999999996,0.051539999999999996,
      0.0616,0.04092,0.015780000000000002,0.03248,0.03248]})
    
    df['Date'] = pd.to_datetime(df['Date'])
    df = df.sort_values(by='Date').reset_index(drop=True)
    df['week'] = df['Date'].dt.isocalendar().week
    df['c'] = df.groupby('Product')['week'].diff().fillna(0).gt(1).astype(int).cumsum()
    
    df = df.groupby(['Product','c']).head(1)
    
    df.drop(columns=['week','c'], inplace=True)
    

    Output

    Date    Product Return
    0   2020-01-01  ABC 0.00993
    3   2020-01-30  ABC 0.04231
    4   2020-02-20  ABC 0.01408
    5   2020-06-09  ASD 0.03248
    6   2020-06-11  EFG 0.02589
    7   2020-06-15  XYZ 0.04868
    11  2020-07-13  EFG 0.02589
    16  2020-08-25  XYZ 0.03248
    17  2020-09-08  EFG 0.06160
    21  2020-09-25  XYZ 0.03248
    22  2020-10-12  XYZ 0.03750
    23  2020-12-02  XYZ 0.02589