I have the following sequences of strings within a column in pandas:
SEQ
An empty world
So the word is
So word is
No word is
I can check the similarity using fuzzywuzzy or cosine distance. However I would like to know how to get information about the word which changes position from amore to another. For example: Similarity between the first row and the second one is 0. But here is similarity between row 2 and 3. They present almost the same words and the same position. I would like to visualize this change (missing word) if possible. Similarly to the 3rd row and the 4th. How can I see the changes between two rows/texts?
Assuming you're using jupyter / ipython and you are just interested in comparisons between a row and that preceding it I would do something like this.
The general concept is:
import pandas as pd
data = ['An empty world',
'So the word is',
'So word is',
'No word is']
df = pd.DataFrame(data, columns=['phrase'])
bold = lambda x: f'<b>{x}</b>'
def highlight_shared(string1, string2, format_func):
shared_toks = set(string1.split(' ')) & set(string2.split(' '))
return ' '.join([format_func(tok) if tok in shared_toks else tok for tok in string1.split(' ') ])
highlight_shared('the cat sat on the mat', 'the cat is fat', bold)
df['previous_phrase'] = df.phrase.shift(1, fill_value='')
df['tokens_shared_with_previous'] = df.apply(lambda x: highlight_shared(x.phrase, x.previous_phrase, bold), axis=1)
from IPython.core.display import HTML
HTML(df.loc[:, ['phrase', 'tokens_shared_with_previous']].to_html(escape=False))