Search code examples
pythonpandasmultiple-columnsdataframemask

Masking multiple columns on a pandas dataframe in Python


I am looking to apply multiply masks on each column of a pandas dataset (respectively to its properties) in Python.

In the next step, I want to find (a) row(s) in the dataframe that fits all conditions.

Therefore I have:

df
Out[27]:
   DE  FL  GA  IA  ID
0   0   1   0   0   0
1   1   0   1   0   1
2   0   0   1   0   0
3   0   1   0   0   0
4   0   0   0   0   0

mask_list = []
for i in range(0,5):

    if i % 2==0:
        mask_list.append(df[[i]]>0)
    else:
        mask_list.append(df[[i]]<1)

concat_frame = pa.DataFrame()
for mask in mask_list:
    concat_frame =pa.concat((concat_frame, mask), axis=1)

concat_frame
Out[48]:
      DE     FL     GA    IA     ID
0  False   False False  True  False
1  True    True  True   True  True
2  False   True  True   True  False
3  False   False False  True  False
4  False   True  False  True  False

[5 rows x 5 columns]

Expected outcome:

Out[60]:
   DE   FL  GA  IA  ID
1   1   0   1   0   1

How can I apply the concat_mask on df, so that I select rows, in which all Boolean criteria are matched (are True)?


Solution

  • You can use the pandas all method and Boolean logic. As EdChum commented, I am a bit unclear still on your exact example, but a similar example is:

    In [1]: df = DataFrame([[1,2],[-3,5]], index=[0,1], columns=['a','b'])
    In [2]: df
    Out [2]:
       a  b
    0  1  2
    1 -3  5
    
    In [3]: msk = (df>1) & (df<5)
    In [4]: msk
    Out [4]:
          a    b
    0 False  True
    1 False False
    
    In [5]: msk.all(axis=1)
    Out [5]:
    0  False
    1  False
    dtype: bool
    

    If you wanted to index the original dataframe by the mask you could do:

    In [6]: df[msk]
    Out [6]:
         a   b
    0  NaN   2
    1  NaN NaN
    

    Or as you originally indicated, rows where all the rows are true:

    In [7]: idx = msk.all(axis=1)
    In [8]: df[idx]
    Out [8]:
    Empty DataFrame
    Columns: [a,b]
    Index: []
    

    Or if one row was true

    In [9]: idx[0] = True
    In [10]: df[idx]
    Out [10]:
      a b
    0 1 2
    

    For the original question after clarification from the comments, where we want different filtering criteria for different columns:

    In [10]: msk1 = df[['a']] < 0
    In [11]: msk2 = df[['b']] > 3
    In [12]: msk = concat((msk1, msk2), axis=1)
    In [12]: slct = msk.all(axis=1)
    In [13]: df.ix[slct]
    Out [13]:
       a b
    1 -3 5