This question is based on another question I asked, where I didn't cover the problem entirely: Pandas - check if a string column contains a pair of strings
This is a modified version of the question.
I have two dataframes :
df1 = pd.DataFrame({'consumption':['squirrel ate apple', 'monkey likes apple',
'monkey banana gets', 'badger gets banana', 'giraffe eats grass', 'badger apple loves', 'elephant is huge', 'elephant eats banana tree', 'squirrel digs in grass']})
df2 = pd.DataFrame({'food':['apple', 'apple', 'banana', 'banana'],
'creature':['squirrel', 'badger', 'monkey', 'elephant']})
The goal is to test if df.food:df.creature pairs are present in df1.consumptions.
The expected answer for this test in the above example would be :
['True', 'False', 'True', 'False', 'False', 'True', 'False', 'True', 'False']
The pattern is:
squirrel ate apple = True since squirrel and apple is a pair. monkey likes apple = False since monkey and apple is not a pair we are looking for.
I was thinking of constructing a dictionary of dataframes of the pair-values where each dataframe would be for one creature for e.g.squirrel, monkey etc. and then using np.where to create a boolean expression and perform a str.contains.
Not sure if that is the easiest way.
Consider this vectorized approach:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
X = vect.fit_transform(df1.consumption)
Y = vect.transform(df2.creature + ' ' + df2.food)
res = np.ravel(np.any((X.dot(Y.T) > 1).todense(), axis=1))
Result:
In [67]: res
Out[67]: array([ True, False, True, False, False, True, False, True, False], dtype=bool)
Explanation:
In [68]: pd.DataFrame(X.toarray(), columns=vect.get_feature_names())
Out[68]:
apple ate badger banana digs eats elephant gets giraffe grass huge in is likes loves monkey squirrel tree
0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0
2 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0
3 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0
5 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
6 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0
7 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 1
8 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0
In [69]: pd.DataFrame(Y.toarray(), columns=vect.get_feature_names())
Out[69]:
apple ate badger banana digs eats elephant gets giraffe grass huge in is likes loves monkey squirrel tree
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0
3 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0
UPDATE:
In [92]: df1['match'] = np.ravel(np.any((X.dot(Y.T) > 1).todense(), axis=1))
In [93]: df1
Out[93]:
consumption match
0 squirrel ate apple True
1 monkey likes apple False
2 monkey banana gets True
3 badger gets banana False
4 giraffe eats grass False
5 badger apple loves True
6 elephant is huge False
7 elephant eats banana tree True
8 squirrel digs in grass False
9 squirrel.eats/apple True # <----- NOTE