Search code examples
pythonpandasloops

How to avoid loop in Pandas iterating unique values?


I have a table with users and times of their actions:

user_ id time user_action
user_1 1 action_1
user_2 2 action_2
user_1 3 action_3
user_2 4 action_4

My algorithm includes looping over unique values and taking some actions using the shift function.


users = df.user_id.unique()

df_2 = pd.DataFrame(columns = ...) # empty DataFrame

for user in users:
    df_new = df[df.user_id == user]
    df_new['...'] = df['user_action'].shift(1)....
    \ some actions using shift \

    df_2 = pd.concat([df_2, df_new])

Also, cumsums are used, so it would be impossible to implement algorithm without dividing into "df_new" pieces.

This method is too long seemingly because of using Python's "for". How can I do it using the build-in Pandas functions?


Solution

  • Without seeing the full code, it's hard to tell exactly what you're trying to achieve.

    But have you looked at DataFrame.groupby()? This returns a DataFrameGroupBy object which also offers shift() and cumsum() methods.

    Thus, you may be able to rewrite your code using groupby():

    df.groupby(by="user_id").shift(1).cumsum()
    

    Another option could be DataFrameGroupby.apply(), transform() or aggregate() depending on your expected output format, applying your original logic to each group:

    def your_actions(df: pd.DataFrame) -> pd.DataFrame:
      ...  # whatever you did for each user
    
    df.groupby(by="user_id").apply(your_actions)
    

    You may get more precise answers if you provide a more complete code sample or a description what you are trying to compute using shift and cumsum, but I hope this already helps by hinting into a possible direction.