Search code examples
pythonpandasmatchsubstr

Verify whether substring of column A matches with substring in column B in Pandas


I have the following dataframe with news text (column Text) and the query (column Query) that I run to extract that specific news text from an API. Not all news texts are relevant to the query so I am trying to create a new column that returns True if at last part of the Query string is within the Text column. I know how to check whether the whole Query string is in the text, but as you see, often the official company name e.g. Nike Ltd. is not how news text report it (they may use more informal abbreviations).

It is a sort of fuzzy matching and I have not really figure out how to solve it.

news = {'Text':['Nike invests in shoes', 'Adidas invests in t-shirts', 'dog drank water before eating the meal'], 'Source':['NYT', 'WP', 'Guardian'], 'Query' :['Nike Ltd.', "LV", 'dog central']}
news_df = pd.DataFrame(news)

Desired Outcome:

desired = {'Text':['Nike invests in shoes', 'Adidas invests in t-shirts', 'dog drank water before eating the meal'], 'Source':['NYT', 'WP', 'Guardian'], 'Query' :['Nike Ltd.', "LV", 'dog central'], 'Outcome':[True, False, True]}
desired _df = pd.DataFrame(desired )

any suggestions?


Solution

  • You wrote part of the Query string is within the Text column, but we shoud set some limits to this statement.

    E.g. a single letter is also a part of the string, but we should not look for such cases.

    You should also treat "better" cases when e.g. 3 words from Query have been matched in Text than a match for a single word.

    Taking the above into account, I think, a more reasonable approach is to:

    • split Query into words,
    • count how many of these words have matches in Text (multiple occurrences should be still counted as one) - this is the result for the current row,
    • apply the above procedure to each row in news_df.

    The code to do it can be as follows:

    import re
    
    def hasMatch(t1, t2):
        cnt = 0
        for wrd in filter(lambda tt: len(tt) > 0, re.split(r'[^a-z]', t2, re.I)):
            if re.search(wrd, t1):
                cnt += 1
        return cnt
    
    desired_df = news_df.copy()
    desired_df['Outcome'] = desired_df.apply(lambda row: hasMatch(row.Text, row.Query), axis=1)
    

    In this case the Outcome column is of int type (not bool), but I think it is beter than bool.

    Or if for you it is enough to find any match (even a single word) and you don't care about the number of such matches, change the above function so that it returns bool values instead of int.