Suppose I have the following df
:
df = pd.DataFrame({
'col1':['x1','x2','x3'],
'col2':['y1','y2','y3'],
'col3':['z1','z2','z3'],
'col4':['a1','b2','c3']
})
and a list of elements:
l = ['x1','x2','y3']
I want to search elements of l
in all the columns of my df
, as it stands from my list x1
and x2
appear in col1
and y3
is in col2
, so I did:
df.loc[df['col1'].apply(lambda x: True if any(i in x for i in l) else False)|
df['col2'].apply(lambda x: True if any(i in x for i in l) else False)]
which gives me
col1 col2 col3 col4
0 x1 y1 z1 a1
1 x2 y2 z2 b2
2 x3 y3 z3 c3
as desired but the above method needs me to make a |
operator for each column. So I wonder how can I do this iteration over all columns efficiently without using |
for every column?
A much, much more efficient way of doing this would be to use numpy broadcasting.
row_mask = (df.to_numpy() == l[:, None, None]).sum(axis=0).any(axis=1)
filtered = df[row_mask]
Output:
>>> filtered
col1 col2 col3 col4
0 x1 y1 z1 a1
1 x2 y2 z2 b2
2 x3 y3 z3 c3