Search code examples
pythonpandasdataframedatasetdata-wrangling

python - Reformat a dataset into cumulative string sequence


I'm using Pandas to wrangle dataframes. I have a dataset of events of users, having one row per event, as below:

userID eventID date event
1 001 2020-01-01 A
1 002 2020-01-02 D
1 003 2020-01-03 C
2 004 2020-01-01 B

And I want to build a dataset where that has the users' history as an attribute, and the current event as another attribute, as below:

userID history current_event
1 O A
1 A D
1 A-D C
2 O B

Where O means no history.

I came up with a for loop-based approach, but it gets impractical and super slow as I use it on the bigger dataset. The approach is as below:

inputdf = df.copy()
final_df = pd.DataFrame(columns=['userID', 'history', 'next_event'])

# for loop
for user in inputdf.userID.unique():  # Loop through users
    for i in range(4):    # Loop through events of a user
        subset = inputdf[inputdf.userID== user].sort_values('date').iloc[:i+1]
        if len(subset)==1:  # if first row for a user
            appendrow = pd.DataFrame([{'userID':user, 'history':'O', 'next_event':subset['event'].iloc[0]}])
            final_df = pd.concat([final_df,appendrow], axis=0, ignore_index=True)
        else:   # if subsequent row for a user
            grouped = inputdf.sort_values(['userID','date']).iloc[:i].groupby(['userID']).agg({'event': '-'.join}).reset_index()  
            appendrow = pd.DataFrame([{'userID':user, 'history':grouped.event.iloc[0], 'next_event':subset['event'].iloc[-1]}])
            final_df = pd.concat([final_df,appendrow], axis=0, ignore_index=True)
final_df = final_df.drop_duplicates()
final_df.head()

Now I want to vectorize the operation. The problem I'm stuck at is that I need to create grouped dataframe till the second last row for a user, and at the same time get the current row event, and put them together. I can't seem to apply a function to every row, as it needs to run in the context of an individual user.

How can I vectorize the operation?


Solution

  • You can use itertools.accumulate in a groupby-apply.

    from itertools import accumulate
    
    df["date"] = pd.to_datetime(df["date"])
    df.sort_values(["userID", "date"], inplace=True)
    
    t = (
        df.groupby("userID")["event"]
        .apply(lambda x: list(accumulate(x.shift().fillna(""), "{}-{}".format)))
        .reset_index()
        .explode("event")
    )
    df["history"] = t["event"].str.strip("-").values
    
    print(df)
    
       userID  eventID       date event history
    0       1        1 2020-01-01     A        
    1       1        2 2020-01-02     D       A
    2       1        3 2020-01-03     C     A-D
    3       2        4 2020-01-01     B