How can I cleanly make a subset of my DataFrame, where my desired columns meet certain criteria?
Take the following DataFrame
as an example:
df = pd.DataFrame(data=[[0, 1, 2],
[2, 3, 4],
[4, 5, 6],
[6, 7, 8]],
columns= ["a", "b", "c"])
I have a boolean array that indicates which columns from my DataFrame
that I would like to apply a filter:
bool_arr = np.array([True, False, True])
I also have an array that holds the threshold values for each column. For each column where the bool_arr
holds True
, The row value needs to be below the threshold value:
thresh_arr = np.array([3, 2, 7])
The DataFrame subset that should result from my filtering/masking is the following:
"a" "b" "c"
0 0 1 2
1 2 3 4
I was able to get this DataFrame subset with the following code, but it is not as clean as I would like:
sel_cols = df.columns[bool_arr]
sel_thresh = thresh_arr[bool_arr]
df_masked = df[sel_cols] < sel_thresh
sel_rows = df.index[df_masked.sum(axis=1) == bool_arr.sum()]
df_subset = df.loc[sel_rows]
Anyone have any ideas on how to make this shorter and cleaner?
Here is another way to do it with Pandas Index.isin, pipe, and lt:
df_subset = df[
df.index.isin(
df.loc[:, bool_arr]
.pipe(lambda df_: df_[df_.lt(thresh_arr[bool_arr])])
.dropna()
.index
)
]
print(df_subset)
# Output
a b c
0 0 1 2
1 2 3 4