Search code examples
pythonpandasperformanceoptimizationstring-matching

Match two data frames by substring in python


I have two big data frames (1000s of rows), and I need to match them by substring, for example:

df1:

Id    Title
1     The house of pump
2     Where is Andijan
3     The Joker
4     Good bars in Andijan
5     What a beautiful house

df2:

Keyword
house
andijan
joker

And the expected output is:

Id    Title                    Keyword
1     The house of pump        house
2     Where is Andijan         andijan
3     The Joker                joker
4     Good bars in Andijan     andijan
5     What a beautiful house   house

Now, I've sort of written a very non-efficient way to match it, but for the actual size of the data frames it runs for an extremely long time:

for keyword in df2.to_dict(orient='records'):
    df1['keyword'] = np.where(creative_df['title'].str.contains(keyword['keyword']), keyword['keyword'], df1['keyword'])

Now, I'm sure there's a more pandas-friendy and efficient way to do the same, and also have it run in a reasonable time.


Solution

  • Let us try findall

    import re
    df1['new'] = df1.Title.str.findall('|'.join(df2.Keyword.tolist()),flags= re.IGNORECASE).str[0]
    df1
       Id                   Title      new
    0   1       The house of pump    house
    1   2        Where is Andijan  Andijan
    2   3               The Joker    Joker
    3   4    Good bars in Andijan  Andijan
    4   5  What a beautiful house    house