Search code examples
pythonpandasdataframepandas-groupbymin

Pandas conditional group by min()


I am trying to get the minimum value of a date variable when the principal balance is below 5% of the disbursement amount. I want this to be extracted by account number, but I don't want a new df that is grouped by account number.

My df looks like this:

| account_number | period_date | principal_balance_amt | disbursement_amt |
| -------------: | ----------- | --------------------- | ---------------- |
| 1              | 2021-01-01  | 10                    | 100              |
| 1              | 2021-02-01  | 6                     | 100              |
| 1              | 2021-03-01  | 3                     | 100              |
| 1              | 2021-04-01  | 0                     | 100              |
| 2              | 2021-01-01  | 20                    | 100              |
| 2              | 2021-02-01  | 15                    | 100              |
| 2              | 2021-03-01  | 11                    | 100              |
| 2              | 2021-04-01  | 8                     | 100              |

I have tried codes similar to this to make it work but it just return invalid syntax.

df['churn_date'] = df.loc[groupby('account_number').(df['principal_balance_amt'] <= 0.05 * df['disbursement_amt']), 'period_date'].min()

I want the code to create a df that looks like this:

account_number period_date principal_balance_amt disbursement_amt churn_date
1 2021-01-01 10 100 2021-03-01
1 2021-02-01 6 100 2021-03-01
1 2021-03-01 3 100 2021-03-01
1 2021-04-01 0 100 2021-03-01
2 2021-01-01 20 100 nan
2 2021-02-01 15 100 nan
2 2021-03-01 11 100 nan
2 2021-04-01 8 100 nan

Solution

  • Use Series.where for replace period_date to NaN if no match and then use GroupBy.transform with min for new column:

    mask = (df['principal_balance_amt'] <= 0.05 * df['disbursement_amt'])
    df['churn_date'] = (df.assign(new = df['period_date'].where(mask))
                          .groupby('account_number')['new']
                          .transform('min'))
    
    print (df)
       account_number period_date  principal_balance_amt  disbursement_amt  \
    0               1  2021-01-01                     10               100   
    1               1  2021-02-01                      6               100   
    2               1  2021-03-01                      3               100   
    3               1  2021-04-01                      0               100   
    4               2  2021-01-01                     20               100   
    5               2  2021-02-01                     15               100   
    6               2  2021-03-01                     11               100   
    7               2  2021-04-01                      8               100   
    
      churn_date  
    0 2021-03-01  
    1 2021-03-01  
    2 2021-03-01  
    3 2021-03-01  
    4        NaT  
    5        NaT  
    6        NaT  
    7        NaT  
    

    Alternative solution with mapping by Series.map only filtered rows by boolean indexing with aggregate min:

    mask = (df['principal_balance_amt'] <= 0.05 * df['disbursement_amt'])
    s = df[mask].groupby('account_number')['period_date'].min()
    
    df['churn_date'] = df['account_number'].map(s)