I have a dataset of user sessions, loaded into a Pandas DataFrame:
SessionID, UserID, Logon_time, Logoff_time
Adx1YiRyvOFApQiniyPWYPo,AbO6vW58ta1Bgrqs.RA0uHg,2016-01-05 07:46:56.180,2016-01-05 08:04:36.057
AfjMzw8In8RDqK6jIfItZPs,Ae8qOxLzozJHrC2pr2dOw88,2016-01-04 14:48:47.183,2016-01-04 14:53:30.210
AYIdSJYsRw5PptkFfEOXPa0,AX3Xy8dRDBRAlhyy3YaWw6U,2016-01-04 11:06:37.040,2016-01-04 16:34:38.770
Ac.WXBBSl75KqEuBmNljYPE,Ae8qOxLzozJHrC2pr2dOw88,2016-01-04 10:58:04.227,2016-01-04 11:21:10.520
AekXRDR3mBBDh49IIN2HdU8,Ae8qOxLzozJHrC2pr2dOw88,2016-01-04 10:16:08.040,2016-01-04 10:34:20.523
AVvL3VSWSq5Fr.f4733X.T4,AX3Xy8dRDBRAlhyy3YaWw6U,2016-01-04 09:19:29.773,2016-01-04 09:40:25.157
What I would like to do is to transform this data into a DataFrame with two columns:
I can do this for a single timestamp, by turning the datetime ranges into an Interval
, then checking how many rows there are where the given timestamp falls into the interval.
However, if I want to do this for one or two years, with a resolution of minute or hour, I'll end up with 8760 loops (in the case of hours) for a single year… This might not be a dealbreaker, but I was wondering if anyone had any other (possibly more elegant) suggestions or ideas.
IIUC, we can do it this way:
df.apply(lambda x: pd.Series([1] * len(pd.date_range(x.Logon_time, x.Logoff_time, freq='T')),
index=pd.date_range(x.Logon_time, x.Logoff_time, freq='T')), axis=1)\
.stack().reset_index(level=0, drop=True).resample('T').count()
Output(head):
2016-01-04 09:19:00 1
2016-01-04 09:20:00 1
2016-01-04 09:21:00 1
2016-01-04 09:22:00 1
2016-01-04 09:23:00 1
Freq: T, dtype: int64
Use Pandas visualization to examine all the data:
df.apply(lambda x: pd.Series([1] * len(pd.date_range(x.Logon_time, x.Logoff_time, freq='T')),
index=pd.date_range(x.Logon_time, x.Logoff_time, freq='T')), axis=1)\
.stack().reset_index(level=0, drop=True).resample('T').count().plot()