I have two big data frames (1000s of rows), and I need to match them by substring, for example:
df1:
Id Title
1 The house of pump
2 Where is Andijan
3 The Joker
4 Good bars in Andijan
5 What a beautiful house
df2:
Keyword
house
andijan
joker
And the expected output is:
Id Title Keyword
1 The house of pump house
2 Where is Andijan andijan
3 The Joker joker
4 Good bars in Andijan andijan
5 What a beautiful house house
Now, I've sort of written a very non-efficient way to match it, but for the actual size of the data frames it runs for an extremely long time:
for keyword in df2.to_dict(orient='records'):
df1['keyword'] = np.where(creative_df['title'].str.contains(keyword['keyword']), keyword['keyword'], df1['keyword'])
Now, I'm sure there's a more pandas-friendy and efficient way to do the same, and also have it run in a reasonable time.
Let us try findall
import re
df1['new'] = df1.Title.str.findall('|'.join(df2.Keyword.tolist()),flags= re.IGNORECASE).str[0]
df1
Id Title new
0 1 The house of pump house
1 2 Where is Andijan Andijan
2 3 The Joker Joker
3 4 Good bars in Andijan Andijan
4 5 What a beautiful house house