python pandas dataframe dataset data-wrangling

python - Reformat a dataset into cumulative string sequence

I'm using Pandas to wrangle dataframes. I have a dataset of events of users, having one row per event, as below:

userID	eventID	date	event
1	001	2020-01-01	A
1	002	2020-01-02	D
1	003	2020-01-03	C
2	004	2020-01-01	B

And I want to build a dataset where that has the users' history as an attribute, and the current event as another attribute, as below:

userID	history	current_event
1	O	A
1	A	D
1	A-D	C
2	O	B

Where O means no history.

I came up with a for loop-based approach, but it gets impractical and super slow as I use it on the bigger dataset. The approach is as below:

inputdf = df.copy()
final_df = pd.DataFrame(columns=['userID', 'history', 'next_event'])

# for loop
for user in inputdf.userID.unique():  # Loop through users
    for i in range(4):    # Loop through events of a user
        subset = inputdf[inputdf.userID== user].sort_values('date').iloc[:i+1]
        if len(subset)==1:  # if first row for a user
            appendrow = pd.DataFrame([{'userID':user, 'history':'O', 'next_event':subset['event'].iloc[0]}])
            final_df = pd.concat([final_df,appendrow], axis=0, ignore_index=True)
        else:   # if subsequent row for a user
            grouped = inputdf.sort_values(['userID','date']).iloc[:i].groupby(['userID']).agg({'event': '-'.join}).reset_index()  
            appendrow = pd.DataFrame([{'userID':user, 'history':grouped.event.iloc[0], 'next_event':subset['event'].iloc[-1]}])
            final_df = pd.concat([final_df,appendrow], axis=0, ignore_index=True)
final_df = final_df.drop_duplicates()
final_df.head()

Now I want to vectorize the operation. The problem I'm stuck at is that I need to create grouped dataframe till the second last row for a user, and at the same time get the current row event, and put them together. I can't seem to apply a function to every row, as it needs to run in the context of an individual user.

How can I vectorize the operation?

Solution

You can use itertools.accumulate in a groupby-apply.

from itertools import accumulate

df["date"] = pd.to_datetime(df["date"])
df.sort_values(["userID", "date"], inplace=True)

t = (
    df.groupby("userID")["event"]
    .apply(lambda x: list(accumulate(x.shift().fillna(""), "{}-{}".format)))
    .reset_index()
    .explode("event")
)
df["history"] = t["event"].str.strip("-").values

print(df)

   userID  eventID       date event history
0       1        1 2020-01-01     A        
1       1        2 2020-01-02     D       A
2       1        3 2020-01-03     C     A-D
3       2        4 2020-01-01     B