Search code examples
pandasdataframearray-broadcasting

Iterating over all columns of dataframe to find list of strings


Suppose I have the following df:

df = pd.DataFrame({
    'col1':['x1','x2','x3'],
    'col2':['y1','y2','y3'],
    'col3':['z1','z2','z3'],
    'col4':['a1','b2','c3']
})

and a list of elements:

l = ['x1','x2','y3']

I want to search elements of l in all the columns of my df, as it stands from my list x1 and x2 appear in col1 and y3 is in col2, so I did:

df.loc[df['col1'].apply(lambda x: True if any(i in x for i in l) else False)|
       df['col2'].apply(lambda x: True if any(i in x for i in l) else False)]

which gives me

  col1 col2 col3 col4
0   x1   y1   z1   a1
1   x2   y2   z2   b2
2   x3   y3   z3   c3

as desired but the above method needs me to make a | operator for each column. So I wonder how can I do this iteration over all columns efficiently without using | for every column?


Solution

  • A much, much more efficient way of doing this would be to use numpy broadcasting.

    row_mask = (df.to_numpy() == l[:, None, None]).sum(axis=0).any(axis=1)
    filtered = df[row_mask]
    

    Output:

    >>> filtered
      col1 col2 col3 col4
    0   x1   y1   z1   a1
    1   x2   y2   z2   b2
    2   x3   y3   z3   c3