I have a huge file with events tracked a website. The data contains, among others, the user_id
and the time_stamp
of events (clicked on a link, viewed an image, etc.). Here is a simplified example:
#%%
import pandas as pd
# make a dataframe
df = pd.DataFrame([['123', 1],
['123', 1],
['123', 19],
['234', 7],
['234', 28],
['234', 29]],
columns=['user_id', 'time_stamp'])
print(df)
What I would like to obtain is a session_id
column, which is counting the sessions for each user. (Alternatively a string with the user_id
and the time_stamp
concatenated, but I assume counting is simpler?) I want it to look somewhat like this:
# make a dataframe
df = pd.DataFrame([['123', 1, 0],
['123', 1, 0],
['123', 19, 1],
['234', 7, 0],
['234', 28, 1],
['234', 29, 1]],
columns=['user_id', 'time_stamp', session_id])
print(df)
I read quite a lot, and tried even more, but I just can't figure out how to do it without a for
loop. There is probably some .shift(1)
involved and something with .groupby()
? Any help is appreciated.
Try:
df['session_id'] = df.groupby('user_id')['time_stamp'].diff() >= 15
df['session_id'] = df.groupby('user_id')['session_id'].cumsum()
print(df)
Prints:
user_id time_stamp session_id
0 123 1 0
1 123 1 0
2 123 19 1
3 234 7 0
4 234 28 1
5 234 29 1