Given the following Pandas DataFrame, how can I get all the rows where only Test1 and Test2 failed? In my true dataset I have about 70 different tests and need an easy way to filter devices based on different tests they failed.
import pandas as pd
data = [['SN-01', 'Fail', 'Pass', 'Pass'],
['SN-02', 'Pass', 'Fail', 'Pass'],
['SN-03', 'Fail', 'Fail', 'Pass'],
['SN-04', 'Fail', 'Fail', 'Fail']]
df = pd.DataFrame(data, columns = ['Serial', 'Test1', 'Test2', 'Test3'])
Out[133]:
Serial Test1 Test2 Test3
0 SN-01 Fail Pass Pass
1 SN-02 Pass Fail Pass
2 SN-03 Fail Fail Pass
3 SN-04 Fail Fail Fail
If I try the following
df.loc[(df.Test1 == "Fail") & (df.Test2 == "Fail"), 'Serial']
Out[132]:
2 SN-03
3 SN-04
Name: Serial, dtype: object
I get SN-03 and SN-04 as output but the correct answer is only SN-03. Is there a way to create a boolean mask and apply to each row and filter them if they are a true match? What is the proper Pythonic/Panda way to get the results?
Update I found a solution, but it feel non Pythonic. Is there no built-in solution to filter rows by comparing each row with a mask?
#Replace pass/fail string with boolean, note fail is True here
df = df.replace({"Pass":False, "Fail":True})
#get all rows which have exactly 2 failed tests
mask = df.sum(axis = 1)==2
#use the mask to get subset of rows.
df2 = df[mask]
#now filter for 2 or more specific tests we care about.
z = df2.loc[(df2['Test1'] == True) & (df2['Test2'] == True), 'Serial' ]
Here is one way:
cols = ['Test1','Test2']
d = df.set_index('Serial').eq('Pass')
d[cols].sum(axis=1).eq(0) & d.sum(axis=1).eq(len(d.columns)-len(cols))
or
df[['Test1','Test2','Test3']].apply(tuple,axis=1).isin([('Fail', 'Fail', 'Pass')])
or
df[['Test1','Test2','Test3']].eq('Fail').sum(axis=1).eq(2) & df['Test3'].eq('Pass')