Search code examples
pythonpandasrowmulti-index

Row selection on multiple column criteria in muti-index dataframe (more efficient solution?)


I have A dataframe which consists of many 'ratio' columns (67 column: ratio_1 till ratio_67). Based on these columns I want to make a new column in my dataframe with an entry 'pass' or 'fail'

I get stuck on two things, to explain my df I made a simple example:

import random
col3=[0,0,0,0,2,4,6,0,0,0,100,200,300,400]
col4=[0,0,0,0,4,6,8,0,0,0,200,900,400, 500]
listt = [np.nan, np.nan , 1, 1, 1, 5, 5, 1, 1,np.nan ,np.nan ]

col5 = [listt[random.randrange(len(listt))]
              for item in range(14)]
col6 = [listt[random.randrange(len(listt))]
              for item in range(14)]
col7 = [listt[random.randrange(len(listt))]
              for item in range(14)]
col8 = [listt[random.randrange(len(listt))]
              for item in range(14)]


df = {'Unit': [1, 1, 1, 1, 2, 2, 2, 3, 4, 5, 6, 6, 6, 6], 'Year': [2014, 2015, 2016, 2017, 2015, 2016, 2017, 2017, 2014, 2015, 2014, 2015, 2016, 2017], 'col3' : col3, 'col4' : col4, 'Ratio1' : col5, 'Ratio2' : col6, 'Ratio2' : col6, 'Ratio3' : col7, 'Ratio4' : col8}
df = pd.DataFrame(data=df)
df = df.groupby(['Unit', 'Year']).sum()

df

In this df i want to make a selection for the rows that have only 0 or 1 values in the ratio columns, should get a pass as comment. And rows that contain in one of the ratio columns some other entry (whatever it is) should get a fail comment.

I try it like this:

[((df_bk_t['Ratio1'] == float(0)) | (df_bk_t['Ratio1'] == float(1))) 
& ((df_bk_t['Ratio2'] == float(0)) | (df_bk_t['Ratio2'] == float(1))) 
& ((df_bk_t['Ratio3'] == float(0)) | (df_bk_t['Ratio3'] == float(1))) 
& ((df_bk_t['Ratio4'] == float(0)) | (df_bk_t['Ratio4'] == float(1)))]

But this method seems very inefficient when i would have to do it for 67 columns. Does anyone know how this can be done more efficient? And how can i make a new column with pass or fail, based on this? Is this method correct for this?

Thank you in advance for help/suggestions,

Jen


Solution

  • Use DataFrame.filter for columns with column name with Ratio, create boolean DataFrame by DataFrame.isin and test if all Trues per row by DataFrame.all, last set new columns values by numpy.where:

    df['new'] = np.where(df.filter(like='Ratio').isin([0,1]).all(axis=1), 'Fail', 'Pass')
    

    If performance is important use numpy solution:

    arr = df.filter(like='Ratio').values
    df['new'] = np.where(np.all(np.isin(arr, [0,1]), axis=1), 'Fail', 'Pass')
    

    print (df)
    
    Unit Year                                                  
    1    2014     0     0     1.0     1.0     0.0     5.0  Pass
         2015     0     0     1.0     0.0     1.0     1.0  Fail
         2016     0     0     1.0     5.0     1.0     0.0  Pass
         2017     0     0     0.0     1.0     1.0     5.0  Pass
    2    2015     2     4     5.0     0.0     0.0     1.0  Pass
         2016     4     6     1.0     1.0     1.0     5.0  Pass
         2017     6     8     0.0     1.0     5.0     5.0  Pass
    3    2017     0     0     0.0     0.0     0.0     5.0  Pass
    4    2014     0     0     0.0     1.0     0.0     1.0  Fail
    5    2015     0     0     1.0     1.0     5.0     1.0  Pass
    6    2014   100   200     0.0     5.0     0.0     0.0  Pass
         2015   200   900     0.0     1.0     1.0     1.0  Fail
         2016   300   400     5.0     1.0     1.0     0.0  Pass
         2017   400   500     0.0     0.0     1.0     0.0  Fail