I have a pandas dataframe as follows,
id | all_items | items_check1 | items_check2 |
---|---|---|---|
1239 | 'foobar,foo,foofoo,bar' | 'foo,bar' | 'foobar' |
3298 | 'foobar,foo' | 'foobar' | 'bar' |
9384 | 'foo,bar' | 'bar,foo' | 'bar' |
I want to check if items in items_check1
are present in all_items
or not, and later save this result into a separate column as check1_output
; and then want to repeat the same process again with items_check2
and all_items
; and save it into check2_output
.
So the [desired output] should look like this,
id | all_items | items_check1 | items_check2 | check1_output | check2_output |
---|---|---|---|---|---|
1239 | 'foobar,foo,foofoo,bar' | 'foo,bar' | 'foobar' | True | True |
3298 | 'foobar,foo' | 'foobar' | 'bar' | True | False |
9384 | 'foo,bar' | 'bar,foo' | 'bar' | True | True |
I have billions of rows and sometimes the number of items in a single cell of all_items
can go upto 100 items. I'm looking for the most efficient way of having this comparison done.
Attempts so far
Following is what I have tried, this was more efficient compared to iterating over rows but I soon figured the output is not always as expected. What could be the possible reason for such a behaviour?
df['check1_output'] = np.where([x[0] in x[1] for x in zip(df['items_check1'], df['all_items'])], True, False)
df['check2_output'] = np.where([x[0] in x[1] for x in zip(df['items_check2'], df['all_items'])], True, False)
[ actual output ]
id | all_items | items_check1 | items_check2 | check1_output | check2_output |
---|---|---|---|---|---|
1239 | 'foobar,foo,foofoo,bar' | 'foo,bar' | 'foobar' | True | True |
3298 | 'foobar,foo' | 'foobar' | 'bar' | True | True |
9384 | 'foo,bar' | 'bar,foo' | 'bar' | False | True |
Here's the snippet to re-produce the above dataframe
df = pd.DataFrame({'id': [1239,3298,9384],
'all_items': ['foobar,foo,foofoo,bar','foobar,foo','foo,bar'],
'items_check1': ['foo,bar','foobar','barfoo'],
'items_check2': ['foobar','bar','bar']
})
EDIT: adding computation time
The approach I mentioned takes, 610µs
time for a 3 row dataframe. But when I run it across the actual data with billions of records, it takes many hours. Hence looking for a much more efficient approach.
Try using issubset()
with str.split()
:
df["check1_output"] = df.apply(lambda x: set(x["items_check1"].split(",")).issubset(x["all_items"].split(",")), axis=1)
df["check2_output"] = df.apply(lambda x: set(x["items_check2"].split(",")).issubset(x["all_items"].split(",")), axis=1)
>>> df
id all_items ... check1_output check2_output
0 1239 foobar,foo,foofoo,bar ... True True
1 3298 foobar,foo ... True False
2 9384 foo,bar ... False True