The table below shows the login and logout time of a given software.
USER_NAME USER_ROLE GMT_LOGIN_TIME GMT_LOGOUT_TIME LOGIN_DURATION
ABCD SUP_235 2017-12-15 11:05:25 2017-12-15 11:09:01 216.0
ABCE SUP_235 2017-12-15 11:23:14 2017-12-15 11:33:17 603.0
ABCF USER_285 2017-12-15 11:44:12 2017-12-15 12:07:52 1420.0
ABCG SUP_230 2017-12-15 18:43:17 2017-12-15 19:00:20 1023.0
ABCH USER_245 2017-12-15 10:36:59 2017-12-15 11:42:00 3901.0
How can I identify simultaneous users? I'll have to do this in Python but I'm not worrying about the implementation in that programming language. I'd like to understand the logic itself.
In the example above users ABCD, ABCE and ABCH are the only simultaneous.
Here is a way to do it using pandas
:
from StringIO import StringIO
# read data into dataframe
data = StringIO("""USER_NAME,USER_ROLE,GMT_LOGIN_TIME,GMT_LOGOUT_TIME,LOGIN_DURATION
ABCD,SUP_235,2017-12-15 11:05:25,2017-12-15 11:09:01,216.0
ABCE,SUP_235,2017-12-15 11:23:14,2017-12-15 11:33:17,603.0
ABCF,USER_285,2017-12-15 11:44:12,2017-12-15 12:07:52,1420.0
ABCG,SUP_230,2017-12-15 18:43:17,2017-12-15 19:00:20,1023.0
ABCH,USER_245,2017-12-15 10:36:59,2017-12-15 11:42:00,3901.0""")
df = pd.read_csv(data, sep=",")
# create a new column for simultaneous
df['simultaneous'] = 0
# loop through dataframe and check condition
for i in df.index:
login, logout = df.loc[i,'GMT_LOGIN_TIME'], df.loc[i,'GMT_LOGOUT_TIME']
this_index = df.index.isin([i])
df.loc[i, 'simultaneous'] = int(any(
(df[~this_index]['GMT_LOGIN_TIME'] <= logout) & (df[~this_index]['GMT_LOGOUT_TIME'] >= login)
))
Output:
USER_NAME USER_ROLE GMT_LOGIN_TIME GMT_LOGOUT_TIME \
0 ABCD SUP_235 2017-12-15 11:05:25 2017-12-15 11:09:01
1 ABCE SUP_235 2017-12-15 11:23:14 2017-12-15 11:33:17
2 ABCF USER_285 2017-12-15 11:44:12 2017-12-15 12:07:52
3 ABCG SUP_230 2017-12-15 18:43:17 2017-12-15 19:00:20
4 ABCH USER_245 2017-12-15 10:36:59 2017-12-15 11:42:00
LOGIN_DURATION simultaneous
0 216.0 1
1 603.0 1
2 1420.0 0
3 1023.0 0
4 3901.0 1
Which is correct because both ABCD
and ABCE
are simultaneous with ABCH
.
Explanation of the logic:
We loop through each index
(row) of the dataframe and grab the login
and logout
times for that user.
Next we want to check all other rows for an overlap, so we create this_index
as an indexer to point to the current row. Using the bitwise inversion operator (~
), we can select the other rows with df[~this_index]
.
For the other rows we check to see if any of them meet the overlap condition, using the any
function. Since this returns a boolean
, we convert to int
. We could also have used sum(condition)
if you wanted to see how many other connections were overlapping with this one.
The result of int(any(condition))
is put in the 'simultaneous'
column.
Refer to this page for the logic behind the overlapping time ranges: Determine Whether Two Date Ranges Overlap
Looping over rows in pandas is slow for large tables. Here is a way to achieve the same results using pandas.Dataframe.apply()
, which should be faster.
df['simultaneous'] = df.apply(
lambda x: int(
any(
(df[df['USER_NAME'] != x['USER_NAME']]['GMT_LOGIN_TIME'] <= x['GMT_LOGOUT_TIME']) &\
(df[df['USER_NAME'] != x['USER_NAME']]['GMT_LOGOUT_TIME'] >= x['GMT_LOGIN_TIME'])
)
),
axis=1
)