Search code examples
pythonaudit-logging

How to identify simultaneous users from a login and logout table?


The table below shows the login and logout time of a given software.

USER_NAME USER_ROLE GMT_LOGIN_TIME      GMT_LOGOUT_TIME       LOGIN_DURATION 
ABCD      SUP_235   2017-12-15 11:05:25 2017-12-15 11:09:01           216.0
ABCE      SUP_235   2017-12-15 11:23:14 2017-12-15 11:33:17           603.0
ABCF      USER_285  2017-12-15 11:44:12 2017-12-15 12:07:52          1420.0
ABCG      SUP_230   2017-12-15 18:43:17 2017-12-15 19:00:20          1023.0
ABCH      USER_245  2017-12-15 10:36:59 2017-12-15 11:42:00          3901.0

How can I identify simultaneous users? I'll have to do this in Python but I'm not worrying about the implementation in that programming language. I'd like to understand the logic itself.

In the example above users ABCD, ABCE and ABCH are the only simultaneous.


Solution

  • Here is a way to do it using pandas:

    from StringIO import StringIO
    
    # read data into dataframe
    data = StringIO("""USER_NAME,USER_ROLE,GMT_LOGIN_TIME,GMT_LOGOUT_TIME,LOGIN_DURATION 
    ABCD,SUP_235,2017-12-15 11:05:25,2017-12-15 11:09:01,216.0
    ABCE,SUP_235,2017-12-15 11:23:14,2017-12-15 11:33:17,603.0
    ABCF,USER_285,2017-12-15 11:44:12,2017-12-15 12:07:52,1420.0
    ABCG,SUP_230,2017-12-15 18:43:17,2017-12-15 19:00:20,1023.0
    ABCH,USER_245,2017-12-15 10:36:59,2017-12-15 11:42:00,3901.0""")
    df = pd.read_csv(data, sep=",")
    
    # create a new column for simultaneous
    df['simultaneous'] = 0
    
    # loop through dataframe and check condition
    for i in df.index:
        login, logout = df.loc[i,'GMT_LOGIN_TIME'], df.loc[i,'GMT_LOGOUT_TIME']
        this_index = df.index.isin([i])
        df.loc[i, 'simultaneous'] = int(any(
            (df[~this_index]['GMT_LOGIN_TIME'] <= logout) & (df[~this_index]['GMT_LOGOUT_TIME'] >= login)
        ))
    

    Output:

      USER_NAME USER_ROLE       GMT_LOGIN_TIME      GMT_LOGOUT_TIME  \
    0      ABCD   SUP_235  2017-12-15 11:05:25  2017-12-15 11:09:01   
    1      ABCE   SUP_235  2017-12-15 11:23:14  2017-12-15 11:33:17   
    2      ABCF  USER_285  2017-12-15 11:44:12  2017-12-15 12:07:52   
    3      ABCG   SUP_230  2017-12-15 18:43:17  2017-12-15 19:00:20   
    4      ABCH  USER_245  2017-12-15 10:36:59  2017-12-15 11:42:00   
    
       LOGIN_DURATION   simultaneous  
    0            216.0             1  
    1            603.0             1  
    2           1420.0             0  
    3           1023.0             0  
    4           3901.0             1  
    

    Which is correct because both ABCD and ABCE are simultaneous with ABCH.

    Explanation of the logic:

    1. We loop through each index (row) of the dataframe and grab the login and logout times for that user.

    2. Next we want to check all other rows for an overlap, so we create this_index as an indexer to point to the current row. Using the bitwise inversion operator (~), we can select the other rows with df[~this_index].

    3. For the other rows we check to see if any of them meet the overlap condition, using the any function. Since this returns a boolean, we convert to int. We could also have used sum(condition) if you wanted to see how many other connections were overlapping with this one.

    4. The result of int(any(condition)) is put in the 'simultaneous' column.

    Refer to this page for the logic behind the overlapping time ranges: Determine Whether Two Date Ranges Overlap

    Edit: A more pandas way

    Looping over rows in pandas is slow for large tables. Here is a way to achieve the same results using pandas.Dataframe.apply(), which should be faster.

    df['simultaneous'] = df.apply(
        lambda x: int(
            any(
                (df[df['USER_NAME'] != x['USER_NAME']]['GMT_LOGIN_TIME'] <= x['GMT_LOGOUT_TIME']) &\
                (df[df['USER_NAME'] != x['USER_NAME']]['GMT_LOGOUT_TIME'] >= x['GMT_LOGIN_TIME'])
            )
        ),
        axis=1
    )