Search code examples
pandasstringnlpsimilarity

Finding similar phases


How can I find similar phases within a large list of phases (i.e. tweets, or movie reviews)?

For example, 'I like chocolate' is similar to 'I like chocolate bar' and 'I like mango'; same as 'I ate apple' vs 'I ate apples'.

import pandas as pd

data = {'Text':  ['I like chocolate',
                  'I like chocolate bar',
                  'I ate apple',
                  'I ate apples',
                  'I like mango',
                  'I can swim']  
        }

df = pd.DataFrame (data, columns = ['Text'])

Solution

  • From fuzzywuzzy package, use extractWithoutOrder the unsorted version of extract to find the similarity between strings:

    # pip install fuzzywuzzy
    # conda install -c conda-forge fuzzywuzzy 
    from fuzzywuzzy.process import extractWithoutOrder as extract
    from operator import itemgetter
    
    ratio = df["Text"].apply(lambda s: list(map(itemgetter(1), extract(s, df["Text"]))))
    out = pd.DataFrame(ratio.tolist(), index=df.index, columns=df.index)
    
    >>> out
         0    1    2    3    4    5
    0  100   95   44   43   64   86
    1   95  100   86   86   86   86
    2   44   86  100   96   49   38
    3   43   86   96  100   48   45
    4   64   86   49   48  100   36
    5   86   86   38   45   36  100