Search code examples
pythonpandasdataframecomparison

Most efficient way of checking whether a string is present in another column's values in Pandas


I have a pandas dataframe as follows,

id all_items items_check1 items_check2
1239 'foobar,foo,foofoo,bar' 'foo,bar' 'foobar'
3298 'foobar,foo' 'foobar' 'bar'
9384 'foo,bar' 'bar,foo' 'bar'

I want to check if items in items_check1 are present in all_items or not, and later save this result into a separate column as check1_output; and then want to repeat the same process again with items_check2 and all_items; and save it into check2_output.

So the [desired output] should look like this,

id all_items items_check1 items_check2 check1_output check2_output
1239 'foobar,foo,foofoo,bar' 'foo,bar' 'foobar' True True
3298 'foobar,foo' 'foobar' 'bar' True False
9384 'foo,bar' 'bar,foo' 'bar' True True

I have billions of rows and sometimes the number of items in a single cell of all_items can go upto 100 items. I'm looking for the most efficient way of having this comparison done.

Attempts so far
Following is what I have tried, this was more efficient compared to iterating over rows but I soon figured the output is not always as expected. What could be the possible reason for such a behaviour?

df['check1_output'] = np.where([x[0] in x[1] for x in zip(df['items_check1'], df['all_items'])], True, False)
df['check2_output'] = np.where([x[0] in x[1] for x in zip(df['items_check2'], df['all_items'])], True, False)

[ actual output ]

id all_items items_check1 items_check2 check1_output check2_output
1239 'foobar,foo,foofoo,bar' 'foo,bar' 'foobar' True True
3298 'foobar,foo' 'foobar' 'bar' True True
9384 'foo,bar' 'bar,foo' 'bar' False True

Here's the snippet to re-produce the above dataframe

df = pd.DataFrame({'id': [1239,3298,9384], 
                   'all_items': ['foobar,foo,foofoo,bar','foobar,foo','foo,bar'],
                   'items_check1': ['foo,bar','foobar','barfoo'],
                   'items_check2': ['foobar','bar','bar']
                  })

EDIT: adding computation time

The approach I mentioned takes, 610µs time for a 3 row dataframe. But when I run it across the actual data with billions of records, it takes many hours. Hence looking for a much more efficient approach.


Solution

  • Try using issubset() with str.split():

    df["check1_output"] = df.apply(lambda x: set(x["items_check1"].split(",")).issubset(x["all_items"].split(",")), axis=1)
    df["check2_output"] = df.apply(lambda x: set(x["items_check2"].split(",")).issubset(x["all_items"].split(",")), axis=1)
    >>> df
         id              all_items  ... check1_output check2_output
    0  1239  foobar,foo,foofoo,bar  ...          True          True
    1  3298             foobar,foo  ...          True         False
    2  9384                foo,bar  ...         False          True