Search code examples
pandasnumpyconditional-statementsmask

pandas: conditionally select a row cell for each column based on a mask


I want to be able to extract values from a pandas dataframe using a mask. However, after searching around, I cannot find a solution to my problem.

df = pd.DataFrame(np.random.randint(0,2, size=(2,10)))
mask = np.random.randint(0,2, size=(1,10))

I basically want the mask to serve as a index lookup for each column.

So if the mask was [0,1] for columns [a,b], I want to return:

df.iloc[0,a], df.iloc[1,b]

but in a pythonic way.

I have tried e.g.:

df.apply(lambda x: df.iloc[mask[x], x] for x in range(len(mask)))

which gives a Type error that I don't understand.

A for loop can work but is slow.


Solution

  • With NumPy, that's covered as advanced-indexing and should be pretty efficient -

    df.values[mask, np.arange(mask.size)]
    

    Sample run -

    In [59]: df = pd.DataFrame(np.random.randint(11,99, size=(5,10)))
    
    In [60]: mask = np.random.randint(0,5, size=(1,10))
    
    In [61]: df
    Out[61]: 
        0   1   2   3   4   5   6   7   8   9
    0  17  87  73  98  32  37  61  58  35  87
    1  52  64  17  79  20  19  89  88  19  24
    2  50  33  41  75  19  77  15  59  84  86
    3  69  13  88  78  46  76  33  79  27  22
    4  80  64  17  95  49  16  87  82  60  19
    
    In [62]: mask
    Out[62]: array([[2, 3, 0, 4, 2, 2, 4, 0, 0, 0]])
    
    In [63]: df.values[mask, np.arange(mask.size)]
    Out[63]: array([[50, 13, 73, 95, 19, 77, 87, 58, 35, 87]])