Search code examples
pythonpandasintervals

Count the number of user sessions, defined as intervals


I have a dataset of user sessions, loaded into a Pandas DataFrame:

SessionID, UserID, Logon_time, Logoff_time
Adx1YiRyvOFApQiniyPWYPo,AbO6vW58ta1Bgrqs.RA0uHg,2016-01-05 07:46:56.180,2016-01-05 08:04:36.057
AfjMzw8In8RDqK6jIfItZPs,Ae8qOxLzozJHrC2pr2dOw88,2016-01-04 14:48:47.183,2016-01-04 14:53:30.210
AYIdSJYsRw5PptkFfEOXPa0,AX3Xy8dRDBRAlhyy3YaWw6U,2016-01-04 11:06:37.040,2016-01-04 16:34:38.770
Ac.WXBBSl75KqEuBmNljYPE,Ae8qOxLzozJHrC2pr2dOw88,2016-01-04 10:58:04.227,2016-01-04 11:21:10.520
AekXRDR3mBBDh49IIN2HdU8,Ae8qOxLzozJHrC2pr2dOw88,2016-01-04 10:16:08.040,2016-01-04 10:34:20.523
AVvL3VSWSq5Fr.f4733X.T4,AX3Xy8dRDBRAlhyy3YaWw6U,2016-01-04 09:19:29.773,2016-01-04 09:40:25.157

What I would like to do is to transform this data into a DataFrame with two columns:

  • A timestamp / period (with e.g. a resolution of minute)
  • The number of sessions that existed at that time

I can do this for a single timestamp, by turning the datetime ranges into an Interval, then checking how many rows there are where the given timestamp falls into the interval.

However, if I want to do this for one or two years, with a resolution of minute or hour, I'll end up with 8760 loops (in the case of hours) for a single year… This might not be a dealbreaker, but I was wondering if anyone had any other (possibly more elegant) suggestions or ideas.


Solution

  • IIUC, we can do it this way:

    df.apply(lambda x: pd.Series([1] * len(pd.date_range(x.Logon_time, x.Logoff_time, freq='T')), 
             index=pd.date_range(x.Logon_time, x.Logoff_time, freq='T')), axis=1)\
      .stack().reset_index(level=0, drop=True).resample('T').count()
    

    Output(head):

    2016-01-04 09:19:00    1
    2016-01-04 09:20:00    1
    2016-01-04 09:21:00    1
    2016-01-04 09:22:00    1
    2016-01-04 09:23:00    1
    Freq: T, dtype: int64
    

    Use Pandas visualization to examine all the data:

    df.apply(lambda x: pd.Series([1] * len(pd.date_range(x.Logon_time, x.Logoff_time, freq='T')),
                                 index=pd.date_range(x.Logon_time, x.Logoff_time, freq='T')), axis=1)\
      .stack().reset_index(level=0, drop=True).resample('T').count().plot()
    

    enter image description here