Search code examples
pythonpandasranking

How to create a ranking column based on date value range in another column?


data = [
    ["Item_1", "2020-06-01"],
    ["Item_1", "2020-06-02"],
    ["Item_1", "2020-05-27"],
    ["Item_2", "2018-04-15"],
    ["Item_2", "2018-04-18"],
    ["Item_2", "2018-04-22"],
    ["Item_2", "2018-04-28"],
]

df = pd.DataFrame(data, columns=["Item_ID", "Dates"])
df

I have a dataset containing a column of Item IDs and Dates. I would like to assign a "ranking" of sorts in a new column, where the rank/order value is increased IF the next date is >3 days from the previous date, otherwise it stays the same.

So the desired output would look like this:

    Item_ID    Dates    Date Order
    Item_1  2020-05-27      1
    Item_1  2020-06-01      2
    Item_1  2020-06-02      2 
    Item_2  2018-04-15      1
    Item_2  2018-04-18      1
    Item_2  2018-04-22      2 
    Item_2  2018-04-28      3

Solution

  • We can use groupby apply to compute the difference per group between days, then use cumsum on the to "count" how many differences are greater than (`gt) 3 days:

    # Convert to datetime (if not already)
    df['Dates'] = pd.to_datetime(df['Dates'])
    # Sort in correct order
    df = df.sort_values(['Item_ID', 'Dates'], ignore_index=True)
    # Calculate Ranking per Group
    df['Date Order'] = (
        df.groupby('Item_ID')['Dates'].apply(
            lambda s: s.diff().gt(pd.Timedelta(days=3)).cumsum() + 1
        )
    )
    

    Could also groupby twice and use groupby diff and groupby cumsum:

    # Convert to datetime (if not already)
    df['Dates'] = pd.to_datetime(df['Dates'])
    # Sort in correct order
    df = df.sort_values(['Item_ID', 'Dates'], ignore_index=True)
    
    # Reuse same Grouper
    g = df.groupby('Item_ID') 
    # Calculate Difference per group and compare (whole Series)
    df['Date Order'] = g['Dates'].diff().gt(pd.Timedelta(days=3))
    # Calculate cumsum per group
    df['Date Order'] = g['Date Order'].cumsum() + 1
    

    Both produce df:

      Item_ID      Dates  Date Order
    0  Item_1 2020-05-27           1
    1  Item_1 2020-06-01           2
    2  Item_1 2020-06-02           2
    3  Item_2 2018-04-15           1
    4  Item_2 2018-04-18           1
    5  Item_2 2018-04-22           2
    6  Item_2 2018-04-28           3
    

    Here is a breakdown of the steps per group as a DataFrame:

    s = pd.Series([pd.Timestamp('2020-05-27 00:00:00'),
                   pd.Timestamp('2020-06-01 00:00:00'),
                   pd.Timestamp('2020-06-02 00:00:00')],
                  name='Dates',
                  index=pd.Series({0: 'Item_1', 1: 'Item_1', 2: 'Item_1'},
                                  name='Item_ID'))
    steps_per_group = pd.DataFrame({
        'diff': s.diff(),
        'gt': s.diff().gt(pd.Timedelta(days=3)),
        'cumsum': s.diff().gt(pd.Timedelta(days=3)).cumsum(),
        'cumsum 1 start': s.diff().gt(pd.Timedelta(days=3)).cumsum() + 1
    })
    
              diff     gt  cumsum  cumsum 1 start
    Item_ID                                      
    Item_1     NaT  False       0               1
    Item_1  5 days   True       1               2
    Item_1  1 days  False       1               2