python pandas cosine-similarity fuzzywuzzy sentence-similarity

Sentence comparison: how to highlight differences

I have the following sequences of strings within a column in pandas:

SEQ
An empty world
So the word is
So word is
No word is

I can check the similarity using fuzzywuzzy or cosine distance. However I would like to know how to get information about the word which changes position from amore to another. For example: Similarity between the first row and the second one is 0. But here is similarity between row 2 and 3. They present almost the same words and the same position. I would like to visualize this change (missing word) if possible. Similarly to the 3rd row and the 4th. How can I see the changes between two rows/texts?

Solution

Assuming you're using jupyter / ipython and you are just interested in comparisons between a row and that preceding it I would do something like this.

The general concept is:

find shared tokens between the two strings (by splitting on ' ' and finding the intersection of two sets).
apply some html formatting to the tokens shared between the two strings.
apply this to all rows.
output the resulting dataframe as html and render it in ipython.

import pandas as pd 

data = ['An empty world',
        'So the word is',
        'So word is',
        'No word is']

df = pd.DataFrame(data, columns=['phrase'])

bold = lambda x: f'<b>{x}</b>'

def highlight_shared(string1, string2, format_func):
    shared_toks = set(string1.split(' ')) & set(string2.split(' '))
    return ' '.join([format_func(tok) if tok in shared_toks else tok for tok in string1.split(' ') ])

highlight_shared('the cat sat on the mat', 'the cat is fat', bold)

df['previous_phrase'] = df.phrase.shift(1, fill_value='')
df['tokens_shared_with_previous'] = df.apply(lambda x: highlight_shared(x.phrase, x.previous_phrase, bold), axis=1)

from IPython.core.display import HTML

HTML(df.loc[:, ['phrase', 'tokens_shared_with_previous']].to_html(escape=False))