Search code examples
pythonpandasdata-analysis

How to find rows of a DataFrame that matches a given set of conditions for all columns?


Given the following Pandas DataFrame, how can I get all the rows where only Test1 and Test2 failed? In my true dataset I have about 70 different tests and need an easy way to filter devices based on different tests they failed.

import pandas as pd

data = [['SN-01', 'Fail', 'Pass', 'Pass'], 
        ['SN-02', 'Pass', 'Fail', 'Pass'], 
        ['SN-03', 'Fail', 'Fail', 'Pass'],
        ['SN-04', 'Fail', 'Fail', 'Fail']]

df = pd.DataFrame(data, columns = ['Serial', 'Test1', 'Test2', 'Test3'])

Out[133]: 
  Serial Test1 Test2 Test3
0  SN-01  Fail  Pass  Pass
1  SN-02  Pass  Fail  Pass
2  SN-03  Fail  Fail  Pass
3  SN-04  Fail  Fail  Fail

If I try the following

df.loc[(df.Test1 == "Fail") & (df.Test2 == "Fail"), 'Serial']
Out[132]: 
2    SN-03
3    SN-04
Name: Serial, dtype: object

I get SN-03 and SN-04 as output but the correct answer is only SN-03. Is there a way to create a boolean mask and apply to each row and filter them if they are a true match? What is the proper Pythonic/Panda way to get the results?

Update I found a solution, but it feel non Pythonic. Is there no built-in solution to filter rows by comparing each row with a mask?

#Replace pass/fail string with boolean, note fail is True here
df = df.replace({"Pass":False, "Fail":True})
#get all rows which have exactly 2 failed tests
mask = df.sum(axis = 1)==2
#use the mask to get subset of rows.
df2 = df[mask]
#now filter for 2 or more specific tests we care about.    
z = df2.loc[(df2['Test1'] == True) & (df2['Test2'] == True), 'Serial' ]

Solution

  • Here is one way:

    cols = ['Test1','Test2']
    d = df.set_index('Serial').eq('Pass')
    d[cols].sum(axis=1).eq(0) & d.sum(axis=1).eq(len(d.columns)-len(cols))
    

    or

    df[['Test1','Test2','Test3']].apply(tuple,axis=1).isin([('Fail', 'Fail', 'Pass')])
    

    or

    df[['Test1','Test2','Test3']].eq('Fail').sum(axis=1).eq(2) & df['Test3'].eq('Pass')