Search code examples
pythonpandaslist-comprehensionboolean-expressiondictionary-comprehension

Write a Boolean expression of arbitrary size from a list or dictionary


I am dealing with pandas dataframes that have some uniform labelling schema for the columns, but an arbitrary number of columns.

For example the following df with columns col_names, a special subset of the columns, and a filter criteria corresponding to those special columns. They're linked by index, or with a dictionary, take your pick.

col_names = ['col1','col2','col3','xyz.1','xyz.2','xyz.3']
relevant_col_names = ['xyz.1', 'xyz.2', 'xyz.3']
filter_criteria = [something1, something2, something3]

I want to get the subset dataframe where 'xyz.1' == something1 & 'xyz.2' == something2 & 'xyz.3' ==something3

I would normally do this by:

whatIwant = df.loc[df['xyz.1']==something1 & df['xyz.2']==something2 & df['xyz.3']==something3]

The issue is I need to be able to write that expression for an arbitrary number of 'xyz' columns without having to manually code the above expression.

For example, if the df happens to have 5 relevant columns like this:

['col1','col2','col3','xyz.1','xyz.2','xyz.3','xyz.4','xyz.5']

Then I would need to automatically write something like this:

whatIwant = df.loc[df['xyz.1']==something1 & df['xyz.2']==something2 & df['xyz.3']==something3 & df['xyz.4']==something4 & df['xyz.5']==something5]

Is there a way to write a boolean expression of arbitrary length based on a list or a dictionary (or something else), where I'm ANDing everything inside of it?

I don't know how to phrase this question for google. It seems related to list and dictionary comprehension, expression synthesis, or something else. Even knowing the right way to phrase this question, how to tag this question for stackoverflow, and/or what I should google would be very helpful. Thanks!


Solution

  • Assuming you have all the "somethings" in a list, you can do this:

    import functools
    import operator
    import pandas as pd
    
    df = pd.DataFrame.from_dict({
        "abc": (1, 2, 3),
        "xyz.1": (4, 2, 7),
        "xyz.2": (8, 5, 5)
    })
    targets = [2, 5]
    
    filtered_dfs = (df[f"xyz.{index + 1}"] == target for index, target in enumerate(targets))
    filtered_df = df.loc[functools.reduce(operator.and_, filtered_dfs)]
    
    print(filtered_df)
    

    We construct each filter on the df in filtered_dfs by doing df[f"xyz.{index + 1}"] == target. For the first iteration through targets, this will be df["xyz.1"] == 2. For the second iteration, it will be df["xyz.2"] == 5.

    We then combine all these filters with functools.reduce(operator.and_, filtered_dfs), which is like doing filtered_df_1 & filtered_df_2 & ....

    We finally apply the filter to the dataframe through df.loc, which gives the rows here that have a 2 in xyz.1 and 5 in xyz.2. Output is:

       abc  xyz.1  xyz.2
    1    2      2      5