data = [
["Item_1", "2020-06-01"],
["Item_1", "2020-06-02"],
["Item_1", "2020-05-27"],
["Item_2", "2018-04-15"],
["Item_2", "2018-04-18"],
["Item_2", "2018-04-22"],
["Item_2", "2018-04-28"],
]
df = pd.DataFrame(data, columns=["Item_ID", "Dates"])
df
I have a dataset containing a column of Item IDs
and Dates
. I would like to assign a "ranking" of sorts in a new column, where the rank/order value is increased IF the next date is >3 days from the previous date, otherwise it stays the same.
So the desired output would look like this:
Item_ID Dates Date Order
Item_1 2020-05-27 1
Item_1 2020-06-01 2
Item_1 2020-06-02 2
Item_2 2018-04-15 1
Item_2 2018-04-18 1
Item_2 2018-04-22 2
Item_2 2018-04-28 3
We can use groupby apply
to compute the difference per group between days, then use cumsum
on the to "count" how many differences are greater than (`gt) 3 days:
# Convert to datetime (if not already)
df['Dates'] = pd.to_datetime(df['Dates'])
# Sort in correct order
df = df.sort_values(['Item_ID', 'Dates'], ignore_index=True)
# Calculate Ranking per Group
df['Date Order'] = (
df.groupby('Item_ID')['Dates'].apply(
lambda s: s.diff().gt(pd.Timedelta(days=3)).cumsum() + 1
)
)
Could also groupby
twice and use groupby diff
and groupby cumsum
:
# Convert to datetime (if not already)
df['Dates'] = pd.to_datetime(df['Dates'])
# Sort in correct order
df = df.sort_values(['Item_ID', 'Dates'], ignore_index=True)
# Reuse same Grouper
g = df.groupby('Item_ID')
# Calculate Difference per group and compare (whole Series)
df['Date Order'] = g['Dates'].diff().gt(pd.Timedelta(days=3))
# Calculate cumsum per group
df['Date Order'] = g['Date Order'].cumsum() + 1
Both produce df
:
Item_ID Dates Date Order
0 Item_1 2020-05-27 1
1 Item_1 2020-06-01 2
2 Item_1 2020-06-02 2
3 Item_2 2018-04-15 1
4 Item_2 2018-04-18 1
5 Item_2 2018-04-22 2
6 Item_2 2018-04-28 3
Here is a breakdown of the steps per group as a DataFrame:
s = pd.Series([pd.Timestamp('2020-05-27 00:00:00'),
pd.Timestamp('2020-06-01 00:00:00'),
pd.Timestamp('2020-06-02 00:00:00')],
name='Dates',
index=pd.Series({0: 'Item_1', 1: 'Item_1', 2: 'Item_1'},
name='Item_ID'))
steps_per_group = pd.DataFrame({
'diff': s.diff(),
'gt': s.diff().gt(pd.Timedelta(days=3)),
'cumsum': s.diff().gt(pd.Timedelta(days=3)).cumsum(),
'cumsum 1 start': s.diff().gt(pd.Timedelta(days=3)).cumsum() + 1
})
diff gt cumsum cumsum 1 start
Item_ID
Item_1 NaT False 0 1
Item_1 5 days True 1 2
Item_1 1 days False 1 2