Finding similar phases

How can I find similar phases within a large list of phases (i.e. tweets, or movie reviews)?

For example, 'I like chocolate' is similar to 'I like chocolate bar' and 'I like mango'; same as 'I ate apple' vs 'I ate apples'.

import pandas as pd

data = {'Text':  ['I like chocolate',
                  'I like chocolate bar',
                  'I ate apple',
                  'I ate apples',
                  'I like mango',
                  'I can swim']  
        }

df = pd.DataFrame (data, columns = ['Text'])

Solution

From fuzzywuzzy package, use extractWithoutOrder the unsorted version of extract to find the similarity between strings:

# pip install fuzzywuzzy
# conda install -c conda-forge fuzzywuzzy 
from fuzzywuzzy.process import extractWithoutOrder as extract
from operator import itemgetter

ratio = df["Text"].apply(lambda s: list(map(itemgetter(1), extract(s, df["Text"]))))
out = pd.DataFrame(ratio.tolist(), index=df.index, columns=df.index)

>>> out
     0    1    2    3    4    5
0  100   95   44   43   64   86
1   95  100   86   86   86   86
2   44   86  100   96   49   38
3   43   86   96  100   48   45
4   64   86   49   48  100   36
5   86   86   38   45   36  100