Search code examples
pythonpandasdataframegroup-by

How can I calculate session IDs from event data, assigning a new ID every time there was no event for 15 minutes?


I have a huge file with events tracked a website. The data contains, among others, the user_id and the time_stamp of events (clicked on a link, viewed an image, etc.). Here is a simplified example:

#%%
import pandas as pd

# make a dataframe
df = pd.DataFrame([['123', 1],
                   ['123', 1],
                   ['123', 19],
                   ['234', 7],
                   ['234', 28],
                   ['234', 29]],
                   columns=['user_id', 'time_stamp'])
print(df)

What I would like to obtain is a session_id column, which is counting the sessions for each user. (Alternatively a string with the user_id and the time_stamp concatenated, but I assume counting is simpler?) I want it to look somewhat like this:

# make a dataframe
df = pd.DataFrame([['123', 1,  0],
                   ['123', 1,  0],
                   ['123', 19, 1],
                   ['234', 7,  0],
                   ['234', 28, 1],
                   ['234', 29, 1]],
                   columns=['user_id', 'time_stamp', session_id])
print(df)

I read quite a lot, and tried even more, but I just can't figure out how to do it without a for loop. There is probably some .shift(1) involved and something with .groupby()? Any help is appreciated.


Solution

  • Try:

    df['session_id'] = df.groupby('user_id')['time_stamp'].diff() >= 15
    df['session_id'] = df.groupby('user_id')['session_id'].cumsum()
    
    print(df)
    

    Prints:

      user_id  time_stamp  session_id
    0     123           1           0
    1     123           1           0
    2     123          19           1
    3     234           7           0
    4     234          28           1
    5     234          29           1