Search code examples
pythonpandasdataframeduplicates

Python data clean in adjacent columns


I have a dataframe where some of the values in the 'Type' column are incorrect:

ID Type Other Data
ABC 10.0 56
ABC 10.0 34
ABC 0.0 78
DEF 20.0 45
DEF 0.0 23
ZWS 20.0 99
TYH 10.0 12
TYH 10.0 63
FCE 0.0 56
FCE 20.0 78
XSA 10.0 74
RED 0.0 55
QWA 20.0 74

I would like to go through each ID and identify if the Type is equal to 0.0, then replace it with a value from the a row where the same id has a value other than 0.0. If there is no other row with the ID, and it is 0.0 in the type column then change it to Nan, or 99 so it can be removed later.

The outcome would be: ABC on 3rd row changes to 10.0 DEF on 5th row changes to 20.0 FCE on 9th row changes to 20.0 RED on 12th row changes to Nan or 99 as no equivalent match found.

What would be the best approach for this problem?

I have tried building dictionary and looping, but I keep getting stuck at the replace 0.0 for values of equivalent key.


Solution

  • You can use a boolean mask to hide rows equal to 0 then group by Type and broadcast the max value (any other value than NaN) to all rows:

    m = df['Type'].eq(0)  # boolean mask
    df['New Type'] = df['Type'].mask(m).groupby(df['ID']).transform('max')
    print(df)
    
    # Output
         ID  Type  Other Data  New Type
    0   ABC  10.0          56      10.0
    1   ABC  10.0          34      10.0
    2   ABC   0.0          78      10.0  # 0.0 -> 10.0
    3   DEF  20.0          45      20.0
    4   DEF   0.0          23      20.0  # 0.0 -> 20.0
    5   ZWS  20.0          99      20.0
    6   TYH  10.0          12      10.0
    7   TYH  10.0          63      10.0
    8   FCE   0.0          56      20.0  # 0.0 -> 20.0
    9   FCE  20.0          78      20.0
    10  XSA  10.0          74      10.0
    11  RED   0.0          55       NaN  # 0.0 -> NaN
    12  QWA  20.0          74      20.0