Search code examples
python-3.xpandasdataframenlpkeyword-search

List of words matched with text column in dataframe


I have 2 dataframes, first with the column of text data (more than 10k rows) and second with keywords (almost 100 list)

DataFrame 1:

           Text
a white house cat plays in garden
cat is a domestic species of small carnivorous mammal
cat is walking in garden behind white house
yellow banana is healthy

DataFrame 2:

ID     Keywords
1    ['cat','white']
2    ['garden','white','cat']
3    ['domestic','mammal']

I want to add column in datafram 1 with ID where maximum number of words matched with dataframe 2. Also, if there is tie between more than 1 or 2 ID's then concatinate both ID's together. There are some cases as well where None of the words matched therefore, Add 'No Match' in that case.

Output:

           Text                                                ID
a white house cat plays in garden                              2
cat is a domestic species of small carnivorous mammal          3
cat is walking in behind white house                           1,2
yellow banana is healthy                                       'No Match'

Solution

  • This will work. It creates a list of the number of matches per list of keywords, then looks up the ID of the max values in that list.

    import pandas as pd
    import ast
    
    df1 = pd.DataFrame(['a white house cat plays in garden', 'cat is a domestic species of small carnivorous mammal', 'cat is walking in behind white house', 'yellow banana is healthy'], columns=['Text'])
    df2 = pd.DataFrame([ { "ID": 1, "Keywords": "['cat','white']" }, { "ID": 2, "Keywords": "['garden','white','cat']" }, { "ID": 3, "Keywords": "['domestic','mammal']" } ])
    df2['Keywords'] = df2['Keywords'].apply(ast.literal_eval)
    
    def get_ids(text):
        matches = [len(set(text.split(" ")) & set(i)) for i in df2['Keywords']]
        matches_ids = [df2['ID'][index] for index, val in enumerate(matches) if val == max(matches) if max(matches)>0 ]
        return ", ".join(str(x) for x in matches_ids) if matches_ids else "No Match"
        
    df1['ID'] = df1['Text'].apply(get_ids)
    

    Result:

    Text ID
    0 a white house cat plays in garden 2
    1 cat is a domestic species of small carnivorous mammal 3
    2 cat is walking in behind white house 1, 2
    3 yellow banana is healthy No Match