I am dealing with pandas dataframes that have some uniform labelling schema for the columns, but an arbitrary number of columns.
For example the following df with columns col_names, a special subset of the columns, and a filter criteria corresponding to those special columns. They're linked by index, or with a dictionary, take your pick.
col_names = ['col1','col2','col3','xyz.1','xyz.2','xyz.3']
relevant_col_names = ['xyz.1', 'xyz.2', 'xyz.3']
filter_criteria = [something1, something2, something3]
I want to get the subset dataframe where 'xyz.1' == something1 & 'xyz.2' == something2 & 'xyz.3' ==something3
I would normally do this by:
whatIwant = df.loc[df['xyz.1']==something1 & df['xyz.2']==something2 & df['xyz.3']==something3]
The issue is I need to be able to write that expression for an arbitrary number of 'xyz' columns without having to manually code the above expression.
For example, if the df happens to have 5 relevant columns like this:
['col1','col2','col3','xyz.1','xyz.2','xyz.3','xyz.4','xyz.5']
Then I would need to automatically write something like this:
whatIwant = df.loc[df['xyz.1']==something1 & df['xyz.2']==something2 & df['xyz.3']==something3 & df['xyz.4']==something4 & df['xyz.5']==something5]
Is there a way to write a boolean expression of arbitrary length based on a list or a dictionary (or something else), where I'm ANDing everything inside of it?
I don't know how to phrase this question for google. It seems related to list and dictionary comprehension, expression synthesis, or something else. Even knowing the right way to phrase this question, how to tag this question for stackoverflow, and/or what I should google would be very helpful. Thanks!
Assuming you have all the "somethings" in a list, you can do this:
import functools
import operator
import pandas as pd
df = pd.DataFrame.from_dict({
"abc": (1, 2, 3),
"xyz.1": (4, 2, 7),
"xyz.2": (8, 5, 5)
})
targets = [2, 5]
filtered_dfs = (df[f"xyz.{index + 1}"] == target for index, target in enumerate(targets))
filtered_df = df.loc[functools.reduce(operator.and_, filtered_dfs)]
print(filtered_df)
We construct each filter on the df
in filtered_dfs
by doing df[f"xyz.{index + 1}"] == target
. For the first iteration through targets
, this will be df["xyz.1"] == 2
. For the second iteration, it will be df["xyz.2"] == 5
.
We then combine all these filters with functools.reduce(operator.and_, filtered_dfs)
, which is like doing filtered_df_1 & filtered_df_2 & ...
.
We finally apply the filter to the dataframe through df.loc
, which gives the rows here that have a 2
in xyz.1
and 5
in xyz.2
. Output is:
abc xyz.1 xyz.2
1 2 2 5