I have a view in Redshift that I'm reading from to create a data frame. The table is structured like the following with about 49k records:
session_id | timestamp | event_text |
---|---|---|
session1 | 2020-07-07 06:45:45.012 | event-A |
session1 | 2020-07-10 04:19:07.477 | event-B |
session2 | 2020-07-10 16:42:24.46 | event-B |
session2 | 2020-07-10 18:57:12.358 | event-C |
session3 | 2020-07-10 16:42:24.46 | event-A |
session3 | 2020-07-10 18:57:12.358 | event-C |
My goal was to create a data frame structured like this:
session_id | event-A | event-B | event-C |
---|---|---|---|
session1 | 1 | 1 | 0 |
session2 | 0 | 1 | 1 |
session3 | 1 | 0 | 1 |
I know this structure as a "truth table" but not sure what other people call it.
I found a way to do this in Python by looping through the query results, like this:
import pandas as pd
import pandas.io.sql as sqlio
# Redshift query
df = sqlio.read_sql_query(master_order_event_view, conn)
events = df.event_text.unique()
unique_sessions = df.session_id.unique()
# Creating Dataframe with session IDs as index and event_text values as columns
truth_df = pd.DataFrame(0, index=unique_sessions, columns=events)
for session_id, event_text in zip(df["session_id"], df["event_text"]):
truth_df.at[session_id, event_text] = 1
return truth_df
My question is: is there a built-in Pandas command/library to do this? I looked around the docs a lot but I couldn't find anything that satisfied it. Perhaps calling it a "truth table" is my problem.
Whether or not there's a built-in way to do this, does anyone have any suggestions of how to make this more efficient? Restructuring the view, using a different library, etc.
Thanks!
Big shoutout to @Dani Mesejo and @sammywemmy who gave me some suggestions.
for session_id, event_text in zip(df["session_id"], df["event_text"]):
truth_df.at[session_id, event_text] = 1
can be replaced by either of the following:
Using crosstab:
truth_df = pd.crosstab(df.session_id, df.event_text)
or groupby:
truth_df = df.groupby(["session_id", "event_text"]).size().unstack(fill_value=0)
Both work great but according to @sammywemmy , groupby is a bit speedier.