How can I find similar phases within a large list of phases (i.e. tweets, or movie reviews)?
For example, 'I like chocolate'
is similar to 'I like chocolate bar'
and 'I like mango'
; same as 'I ate apple'
vs 'I ate apples'
.
import pandas as pd
data = {'Text': ['I like chocolate',
'I like chocolate bar',
'I ate apple',
'I ate apples',
'I like mango',
'I can swim']
}
df = pd.DataFrame (data, columns = ['Text'])
From fuzzywuzzy
package, use extractWithoutOrder
the unsorted version of extract
to find the similarity between strings:
# pip install fuzzywuzzy
# conda install -c conda-forge fuzzywuzzy
from fuzzywuzzy.process import extractWithoutOrder as extract
from operator import itemgetter
ratio = df["Text"].apply(lambda s: list(map(itemgetter(1), extract(s, df["Text"]))))
out = pd.DataFrame(ratio.tolist(), index=df.index, columns=df.index)
>>> out
0 1 2 3 4 5
0 100 95 44 43 64 86
1 95 100 86 86 86 86
2 44 86 100 96 49 38
3 43 86 96 100 48 45
4 64 86 49 48 100 36
5 86 86 38 45 36 100