Search code examples
pythonpandasdynamicmultiple-columnsmultipleselection

Dynamic comparison of values of n multiple Pandas columns


Let's say a user can input the columns and values to compare for a DF, so we can have:

column_list = ['col1', 'col2', 'col3']
value_list = [val1, val2, val3]

So to select the rows that satisfy where col1 >= val1 AND col2 >= val2 AND col3 >= val3 we would write:

selection = (df['col1'] >= val1) & (df['col2'] >= val2) & (df['col3'] >= val3))

or it can be in the form:

selection  = df.loc[(df['col1'] >= val1) & (df['col2'] >= val2) & (df['col3'] >= val3)]

The number of columns is not known in advance, so we can have n columns. We can try this approach:

if n=1:
   selection = (df['col1'] >= val1))
elif n=2:
   selection = (df['col1'] >= val1) & (df['col2'] >= val2))
elif n=3:
   selection = (df['col1'] >= val1) & (df['col2'] >= val2) & (df['col3'] >= val3))

But this is neither scalable nor efficient. I tried by generating strings df['col<>'] >= val<>) with a for loop given the input lists but it didn't work for Pandas because of the str format.

What would be the best pythonic approach for this? To avoid having all the options with if and else statements.


Solution

  • To perform a comparison with the same operator for all columns, create a Series with the values and columns ids and use it to perform an aligned comparison with the dataframe:

    df[df.gt(pd.Series(value_list, index=column_list)).all(1)]
    

    Example input:

    >>> value_list
    [3, 7, 11]
    >>> df
       col1  col2  col3
    0     0     1     2
    1     3     4     5
    2     6     7     8
    3     9    10    11
    4    12    13    14
    

    output:

       col1  col2  col3
    4    12    13    14
    

    intermediates:

    >>> pd.Series(value_list, index=column_list)
    col1     3
    col2     7
    col3    11
    
    >>> df.gt(pd.Series(value_list, index=column_list))
        col1   col2   col3
    0  False  False  False
    1  False  False  False
    2   True  False  False
    3   True   True  False
    4   True   True   True
    
    >>> df.gt(pd.Series(value_list, index=column_list)).all(1)
    0    False
    1    False
    2    False
    3    False
    4     True