Search code examples
pythonpandasdataframeseries

Longest continuous streaks of multiple users


I want to find the solution for this,

Provided a table with user_id and the dates they visited the platform, find the top 100 users with the longest continuous streak of visiting the platform as of yesterday.

I found these links that explain how to do this for one user. However, I am not sure how to do it for multiple users.

One naive might be to get all unique users and using a for loop and above answer, get the users with a maximum continuous visiting streak. However, I am interested in a vectorised way if possible.

If needed, these are the codes I used,

date_series = pd.Series(np.random.randint(0,10, 400), index=pd.to_datetime(np.random.randint(0,20, 400)*1e9*24*3600), name="uid")
df = date_series.reset_index().rename({"index":"date_val"}, axis=1).drop_duplicates().reset_index(drop=True)

For a given user id (say uid =1), I can use the following to find the max streak,

sub_df = df[df.uid==1].sort_values("date_val")
(sub_df.date_val+pd.Timedelta(days=1) != sub_df.date_val.shift(-1)).cumsum().value_counts().max()

But I don't understand how to do a similar thing for all users in the original dataframe (df) with a vectorized (not for loop) approach.


Solution

  • With the help of @wwnde's answer, I found the below answer. Posting it in case someone finds it useful.

    df.sort_values(["uid", "date_val"], inplace=True)  # sort the df
    
    df["diff1"] = df.date_val.diff().dt.days  # new column to store the date diff
    
    df["user_segments"] = ((df.diff1 != 1)|(df.uid != df.uid.shift(-1))).cumsum()  # to create groups of consecutive days
    
    df.groupby(["uid", "user_segments"])["date_val"].count().reset_index()\ # now date_val column have consecutive day counts
    .groupby("uid")["date_val"].max()\  # then group by and get the max for each user
    .sort_values(ascending=False).iloc[:100]  # finally sort it and get the first 100 users