Search code examples
pythonpandasgroup-bymultiple-columnsupdates

Column Update from multiple columns and column headers in Pandas


I have a specific problem ( not sure if its very challenging to the pros and experts here, but it seems pretty formidable to me) to fix by updating a column based on some conditions in a column headers and values in a column:

I am providing some specific rows of the input Dataframe as an example:

   df-in
       
        A     B    ind     M1P   M2P   M3P   M4P   M5P
        
        x     a     2       0     0     3     5     9
        y     b     2      Nan    Nan   Nan   7     11
        z     c     2      0      Nan   0     3     3
        w     d     2      0      0     0     Nan   8
        u     q     2      0      0     0     Nan   0

So now, based on the value of the column 'ind' I need to check the column Mx ( where x can be 1,2,3,4,5). In the above example since all values in ind column are 2, I need to check M2P column and above ( I do not care about M1 column, However if ind was 1 I had to check M1 column). Now in this example if M2P column is 0, nan or blank, it gets the value from M3P, if M3P is also blank, 0, or null, it takes value from M4P. If M4P is also blank, null or 0, it gets the value from M5P, however if M5P value is blank/0/nan, then the value in M2P remains as it is ( the same logic needs to be created if ind is 1,2,3, or 5, that is if ind is 5, then it does not look anywhere else)

So the output of the above should be:

    df-out    
    
        A     B    ind     M1P   M2P   M3P   M4P   M5P
        
        x     a     2       0     3     3     5     9
        y     b     2      Nan    7    Nan    7     11
        z     c     2      0      3     0     3     3
        w     d     2      0      8     0     Nan   8
        u     q     2      0      0     0     Nan   0

I am still struggling to figure what should be the best way to attack this problem in pandas. Not able to understand yet. Any help/codes and ideas will be immensely appreciated.


Solution

  • Use:

    #get DataFrame with M + number + P pattern
    df1 = df.filter(regex='M\d+P')
    #extract numbers and convert to integers
    cols = df1.columns.str.extract('(\d+)', expand=False).astype(int)
    
    #compare by ind column
    m = cols.to_numpy() == df['ind'].to_numpy()[:, None]
    
    #update back filled missing values - only masked rows 
    df.update(df1.replace(['Nan', 0, '0'], np.nan).bfill(axis=1).where(m))
    print (df)
       A  B  ind  M1P  M2P  M3P  M4P  M5P
    0  x  a    2    0    3    3    5    9
    1  y  b    2  Nan    7  Nan    7   11
    2  z  c    2    0    3    0    3    3
    3  w  d    2    0  8.0    0  Nan    8
    4  u  q    2    0    0    0  Nan    0