Search code examples
pythonpandas

Pandas Change values of a dataframe based on an override


I have a pandas dataframe which looks something like this.

orig |  dest |  type | class |  BKT   | BKT_order | value   | fc_Cap | sc_Cap
-----+-------+-------+-------+--------+-----------+---------+--------+---------
 AMD |  TRY  |   SA  | fc    |  MA    |   1       |   12.04 |   20   |   50
 AMD |  TRY  |   SA  | fc    |  TY    |   2       |   11.5  |   20   |   50
 AMD |  TRY  |   SA  | fc    |  NY    |   3       |   17.7  |   20   |   50
 AMD |  TRY  |   SA  | fc    |  MU    |   4       |   09.7  |   20   |   50
 AMD |  TRY  |   PE  | fc    |  RE    |   1       |   09.7  |   20   |   50
 AMD |  TRY  |   PE  | sc    |  EW    |   5       |   07.7  |   20   |   50
 NCL |  MNK  |   PE  | sc    |  PO    |   2       |   08.7  |   20   |   50
 NCL |  MNK  |   PE  | sc    |  TU    |   3       |   12.5  |   20   |   50
 NCL |  MNK  |   PE  | sc    |  MA    |   1       |   16.7  |   20   |   50

Also i have an override Dataframe which may look something like this:

orig |  dest |  type |  max_BKT 
-----+-------+-------+-----------
 AMD |  TRY  |   SA  |  TY
 NCL |  MNK  |   PE  |  PO
 NCL |  AGZ  |   PE  |  PO

what i want to do is modify the original dataframe such that after comparison of orig dest type & BKT ( with max_BKT) values, the value column for any rows which have the BKT_order higher than or equal to the max_BKT in override DF is set to either fc_Cap or sc_Cap depending on the class value.

For Example in above scenario,

Since the Override DF sets max_BKT as TY for AMD | TRY | SA and the bucket order for TY is 2 in original Df, i need to set the value column equal to fc_Cap or sc_Cap depending on the value of class for all rows where BKT_order >= 2

So basically:

  • filter the rows for orig dest type combination
  • Get the BKT_order of max_BKT from the Original DF
  • for each row that matches the above criteria
    • if class == fc update value column with fc_Cap
    • if class == sc update value column with sc_Cap

So our original DF looks something like this:

orig |  dest |  type | class |  BKT   | BKT_order | value   | fc_Cap | sc_Cap
-----+-------+-------+-------+--------+-----------+---------+--------+---------
 AMD |  TRY  |   SA  | fc    |  MA    |   1       |   12.04 |   20   |   50
 AMD |  TRY  |   SA  | fc    |  TY    |   2       |   20    |   20   |   50
 AMD |  TRY  |   SA  | fc    |  NY    |   3       |   20    |   20   |   50
 AMD |  TRY  |   SA  | fc    |  MU    |   4       |   20    |   20   |   50
 AMD |  TRY  |   PE  | fc    |  RE    |   1       |   09.7  |   20   |   50
 AMD |  TRY  |   PE  | sc    |  EW    |   5       |   07.7  |   20   |   50
 NCL |  MNK  |   PE  | sc    |  PO    |   2       |   50    |   20   |   50
 NCL |  MNK  |   PE  | sc    |  TU    |   3       |   50    |   20   |   50
 NCL |  MNK  |   PE  | sc    |  MA    |   1       |   16.7  |   20   |   50

I have tried an approach to iterate over the override df and try to handle 1 row at a time but, i get stuck when i need to do a reverse lookup to get the BKT_order of the max_BKT from original df.

Hope that makes sense... i am fairly new to pandas.


Solution

  • That's a fairly complex task. The individual steps are straightforward though.

    You need:

    idx, cols = pd.factorize(df['class']+'_Cap')
    
    group = ['orig', 'dest', 'type']
    
    out = (
        df.merge(override, on=group, how='left')
        .assign(
            value=lambda x: x['value'].mask(
                x['BKT_order'].ge(
                    x['BKT_order']
                    .where(x['BKT'].eq(x['max_BKT']))
                    .groupby([x[c] for c in group])
                    .transform('first')
                ),
                x.reindex(cols, axis=1).to_numpy()[np.arange(len(x)), idx],
            )
        )
        .reindex(columns=df.columns)
    )
    

    Output:

      orig dest type class BKT  BKT_order  value  fc_Cap  sc_Cap
    0  AMD  TRY   SA    fc  MA          1  12.04      20      50
    1  AMD  TRY   SA    fc  TY          2  20.00      20      50
    2  AMD  TRY   SA    fc  NY          3  20.00      20      50
    3  AMD  TRY   SA    fc  MU          4  20.00      20      50
    4  AMD  TRY   PE    fc  RE          1   9.70      20      50
    5  AMD  TRY   PE    sc  EW          5   7.70      20      50
    6  NCL  MNK   PE    sc  PO          2  50.00      20      50
    7  NCL  MNK   PE    sc  TU          3  50.00      20      50
    8  NCL  MNK   PE    sc  MA          1  16.70      20      50
    

    Intermediates:

    tmp = df.merge(override, on=group, how='left')
    tmp['cap'] = tmp.reindex(cols, axis=1).to_numpy()[np.arange(len(tmp)), idx]
    tmp['mask'] = tmp['BKT'].eq(tmp['max_BKT'])
    tmp['masked_BKT'] = tmp['BKT_order'].where(tmp['mask'])
    tmp['ref_BKT'] = tmp.groupby(group)['masked_BKT'].transform('first')
    tmp['>= ref_BKT'] = tmp['BKT_order'].ge(tmp['ref_BKT'])
    
      orig dest type class BKT  BKT_order  value  fc_Cap  sc_Cap max_BKT  cap   mask  masked_BKT  ref_BKT  >= ref_BKT
    0  AMD  TRY   SA    fc  MA          1  12.04      20      50      TY   20  False         NaN      2.0       False
    1  AMD  TRY   SA    fc  TY          2  11.50      20      50      TY   20   True         2.0      2.0        True
    2  AMD  TRY   SA    fc  NY          3  17.70      20      50      TY   20  False         NaN      2.0        True
    3  AMD  TRY   SA    fc  MU          4   9.70      20      50      TY   20  False         NaN      2.0        True
    4  AMD  TRY   PE    fc  RE          1   9.70      20      50     NaN   20  False         NaN      NaN       False
    5  AMD  TRY   PE    sc  EW          5   7.70      20      50     NaN   50  False         NaN      NaN       False
    6  NCL  MNK   PE    sc  PO          2   8.70      20      50      PO   50   True         2.0      2.0        True
    7  NCL  MNK   PE    sc  TU          3  12.50      20      50      PO   50  False         NaN      2.0        True
    8  NCL  MNK   PE    sc  MA          1  16.70      20      50      PO   50  False         NaN      2.0       False