Search code examples
pythonrif-statementpandas-groupby

Python Group By and Case When Equivalent Relative to R


I am trying to do a case when/if-else statement in Python on a grouped dataframe to create a new variable. I would want to do the following if I was coding in R and I am trying find an operation in Python that's similar and also vectorized. R code:

dt %>% group_by(user,merchant,date) %>%
mutate(
new_variable = case_when(-amount == lag(amount) ~ 2,
                         True ~ 1)
) %>% ungroup()

In Python I have tried using np.select:

    conditions = [
    (-us_trans['real_amount'] == us_trans['real_amount'].shift(-1)),
    (-us_trans['real_amount'] != us_trans['real_amount'].shift(-1))

]

    values = [
        2, 
        1
        
    ]

But I don't know how to use np.select on a grouped dataframe to create a new variable.

I know that I can use groupby(['user','merchant','date'].apply and pass an if-else statement but I believe this would be done in a loop and I am trying to do it in a vectorized manner to optimize my code.

Thanks!


Solution

  • Using the slow pandas option:

    df["new_variable"] = np.where(df.groupby(['user', 'merchant','date'])['amount'].apply(lambda g: g.shift(-1)==-g),2,1)
    

    However, using datatable, along with shift(), ifelse() and by() will be MUCH faster

    from datatable import dt, f, by
    
    df = dt.Frame(df)
    
    df[:,
       dt.update(new_variable=dt.ifelse(-1*dt.shift(f.amount)==f.amount,2,1)),
       by(f.user,f.merchant,f.date)
    ]