I want to find the solution for this,
Provided a table with user_id and the dates they visited the platform, find the top 100 users with the longest continuous streak of visiting the platform as of yesterday.
I found these links that explain how to do this for one user. However, I am not sure how to do it for multiple users.
One naive might be to get all unique users and using a for loop and above answer, get the users with a maximum continuous visiting streak. However, I am interested in a vectorised way if possible.
If needed, these are the codes I used,
date_series = pd.Series(np.random.randint(0,10, 400), index=pd.to_datetime(np.random.randint(0,20, 400)*1e9*24*3600), name="uid")
df = date_series.reset_index().rename({"index":"date_val"}, axis=1).drop_duplicates().reset_index(drop=True)
For a given user id (say uid =1), I can use the following to find the max streak,
sub_df = df[df.uid==1].sort_values("date_val")
(sub_df.date_val+pd.Timedelta(days=1) != sub_df.date_val.shift(-1)).cumsum().value_counts().max()
But I don't understand how to do a similar thing for all users in the original dataframe (df) with a vectorized (not for loop) approach.
With the help of @wwnde's answer, I found the below answer. Posting it in case someone finds it useful.
df.sort_values(["uid", "date_val"], inplace=True) # sort the df
df["diff1"] = df.date_val.diff().dt.days # new column to store the date diff
df["user_segments"] = ((df.diff1 != 1)|(df.uid != df.uid.shift(-1))).cumsum() # to create groups of consecutive days
df.groupby(["uid", "user_segments"])["date_val"].count().reset_index()\ # now date_val column have consecutive day counts
.groupby("uid")["date_val"].max()\ # then group by and get the max for each user
.sort_values(ascending=False).iloc[:100] # finally sort it and get the first 100 users