I'm using Pandas to wrangle dataframes. I have a dataset of events of users, having one row per event, as below:
userID | eventID | date | event |
---|---|---|---|
1 | 001 | 2020-01-01 | A |
1 | 002 | 2020-01-02 | D |
1 | 003 | 2020-01-03 | C |
2 | 004 | 2020-01-01 | B |
And I want to build a dataset where that has the users' history as an attribute, and the current event as another attribute, as below:
userID | history | current_event |
---|---|---|
1 | O | A |
1 | A | D |
1 | A-D | C |
2 | O | B |
Where O
means no history.
I came up with a for loop-based approach, but it gets impractical and super slow as I use it on the bigger dataset. The approach is as below:
inputdf = df.copy()
final_df = pd.DataFrame(columns=['userID', 'history', 'next_event'])
# for loop
for user in inputdf.userID.unique(): # Loop through users
for i in range(4): # Loop through events of a user
subset = inputdf[inputdf.userID== user].sort_values('date').iloc[:i+1]
if len(subset)==1: # if first row for a user
appendrow = pd.DataFrame([{'userID':user, 'history':'O', 'next_event':subset['event'].iloc[0]}])
final_df = pd.concat([final_df,appendrow], axis=0, ignore_index=True)
else: # if subsequent row for a user
grouped = inputdf.sort_values(['userID','date']).iloc[:i].groupby(['userID']).agg({'event': '-'.join}).reset_index()
appendrow = pd.DataFrame([{'userID':user, 'history':grouped.event.iloc[0], 'next_event':subset['event'].iloc[-1]}])
final_df = pd.concat([final_df,appendrow], axis=0, ignore_index=True)
final_df = final_df.drop_duplicates()
final_df.head()
Now I want to vectorize the operation. The problem I'm stuck at is that I need to create grouped dataframe till the second last row for a user, and at the same time get the current row event, and put them together. I can't seem to apply a function to every row, as it needs to run in the context of an individual user.
How can I vectorize the operation?
You can use itertools.accumulate
in a groupby-apply.
from itertools import accumulate
df["date"] = pd.to_datetime(df["date"])
df.sort_values(["userID", "date"], inplace=True)
t = (
df.groupby("userID")["event"]
.apply(lambda x: list(accumulate(x.shift().fillna(""), "{}-{}".format)))
.reset_index()
.explode("event")
)
df["history"] = t["event"].str.strip("-").values
print(df)
userID eventID date event history
0 1 1 2020-01-01 A
1 1 2 2020-01-02 D A
2 1 3 2020-01-03 C A-D
3 2 4 2020-01-01 B