Search code examples
pythonpandasdataframefiltermask

Get subset of DataFrame where certain columns meet certain criteria


How can I cleanly make a subset of my DataFrame, where my desired columns meet certain criteria?

Take the following DataFrame as an example:

df = pd.DataFrame(data=[[0, 1, 2], 
                        [2, 3, 4], 
                        [4, 5, 6], 
                        [6, 7, 8]], 
                  columns= ["a", "b", "c"])

I have a boolean array that indicates which columns from my DataFrame that I would like to apply a filter:

bool_arr = np.array([True, False, True])

I also have an array that holds the threshold values for each column. For each column where the bool_arr holds True, The row value needs to be below the threshold value:

thresh_arr = np.array([3, 2, 7])

The DataFrame subset that should result from my filtering/masking is the following:

  "a" "b" "c"
0  0   1   2
1  2   3   4

I was able to get this DataFrame subset with the following code, but it is not as clean as I would like:

sel_cols = df.columns[bool_arr]
sel_thresh = thresh_arr[bool_arr]
df_masked = df[sel_cols] < sel_thresh
sel_rows = df.index[df_masked.sum(axis=1) == bool_arr.sum()]
df_subset = df.loc[sel_rows]

Anyone have any ideas on how to make this shorter and cleaner?


Solution

  • Here is another way to do it with Pandas Index.isin, pipe, and lt:

    df_subset = df[
        df.index.isin(
            df.loc[:, bool_arr]
            .pipe(lambda df_: df_[df_.lt(thresh_arr[bool_arr])])
            .dropna()
            .index
        )
    ]
    
    print(df_subset)
    # Output
       a  b  c
    0  0  1  2
    1  2  3  4