Search code examples
python-3.xpandaslarge-data

How to efficiently find dataframe row that contains items from list?


Let's say I have following example

items = ['milk', 'bread', 'water']

df:
name     item1    item2    item3

items_1  milk     water
items_2  milk     rubber   juice
items_3  juice    paper    wood
items_4  bread
items_5  bread    water    milk
items_6  milk     juice

In this example I would like to get all df rows whose members are completely in items list, which means:

  • items_1
  • items_4
  • items_5

Now, the real "df" dataframe will contain several millions of rows, i.e. items_*, hence the "efficiently" in the title. The number of columns of "df" will be between 10 and 20. Also, there will be several thousands of "items" lists with between 10 and 20 elements.

Can someone please help me out on this one?


Solution

  • Use ~isin to check if the condition is NOT True for all the values, get the index, use boolean indexing. You get

    true_names = df[~df.iloc[:, 1:].isin(items)].isnull().all(1)
    df.loc[true_names, 'name']
    
    0    name_1
    3    name_4
    4    name_5