i want to see if there is an efficient way to reshape dataframe that is read from a tab separated csv file. the data consist of event codes and messages that are stacked up vertically with their timestamps. the data include a state column that specify whether an event occured (TRUE) or cleared (FALSE). i tried iterating through each row and update accordingly, but it is taking forever to complete.
every event code can occur and clear multiple times. every occur event (TRUE) is followed by clear event (FALSE) for every EventCode.
below example shows the format of the input file:
Timestamp State Data EventCode EventMsg Class
19-May-2023 16:10:09.301 FALSE 1 EventCode 1 EventCode 1 Message class 1
19-May-2023 16:10:09.300 FALSE 2 EventCode 2 EventCode 2 Message class 1
19-May-2023 16:10:09.299 TRUE 3 EventCode 1 EventCode 1 Message class 2
19-May-2023 16:10:09.298 FALSE 4 EventCode 4 EventCode 4 Message class 2
19-May-2023 16:10:09.297 FALSE 5 EventCode 3 EventCode 3 Message class 2
19-May-2023 16:10:09.296 TRUE 6 EventCode 2 EventCode 2 Message class 1
19-May-2023 16:10:09.295 TRUE 7 EventCode 4 EventCode 4 Message class 2
19-May-2023 16:10:09.294 TRUE 8 EventCode 3 EventCode 3 Message class 2
19-May-2023 16:10:09.293 FALSE 0 EventCode 1 EventCode 1 Message class 2
19-May-2023 16:10:09.292 TRUE 9 EventCode 1 EventCode 1 Message class 2
below shows desired final format:
OccurTimestamp clearTimestamp Data EventCode EventMsg Class
19-05-2023 16:10:09.299 19-05-2023 16:10:09.301 3 EventCode 1 EventCode 1 Message class 1
19-05-2023 16:10:09.296 19-05-2023 16:10:09.300 6 EventCode 2 EventCode 2 Message class 1
19-05-2023 16:10:09.295 19-05-2023 16:10:09.298 7 EventCode 4 EventCode 3 Message class 2
19-05-2023 16:10:09.294 19-05-2023 16:10:09.297 8 EventCode 3 EventCode 4 Message class 2
19-05-2023 16:10:09.292 19-05-2023 16:10:09.293 9 EventCode 1 EventCode 1 Message class 1
Based on your example, I assume that each EventCode only exists exactly two times (once with State = True
and once with state = False
). Then this should work:
# sort so that values with state = True are first
# then group the values based on EventCode
# then only get the first rows for each group (those with state = True)
# then reset the index to get a normal dataframe back
# and rename the Timestamp column to OccurTimestamp
new_df = df.sort_values(by = 'State', ascending=False) \
.groupby('EventCode', group_keys=False) \
.first() \
.reset_index() \
.rename(columns={'Timestamp': 'OccurTimestamp'})
# now we just need the corresponding clearTimetamps:
clear_timestamps = df[df['State'] == False][['Timestamp', 'EventCode']].rename(columns = {'Timestamp': 'clearTimestamp'})
# and merge both dataframes based on the EventCode
final = pd.merge(new_df, clear_timestamps, on = 'EventCode')
Edit: Based on your additional info, I think you still need to iterate through the events as your problem inherently requires to check following events for a clear event.
You can still use
grouped = df.groupby('EventCode')
as each group then only consists of the individual EventCodes preserving the order of your original dataframe. This avoids iterating over all other events when trying to find a corresponding clear event.
You can then iterate over each group via
for name, group in grouped:
...