Search code examples
pythonpandascosine-similarityfuzzywuzzysentence-similarity

Sentence comparison: how to highlight differences


I have the following sequences of strings within a column in pandas:

SEQ
An empty world
So the word is
So word is
No word is

I can check the similarity using fuzzywuzzy or cosine distance. However I would like to know how to get information about the word which changes position from amore to another. For example: Similarity between the first row and the second one is 0. But here is similarity between row 2 and 3. They present almost the same words and the same position. I would like to visualize this change (missing word) if possible. Similarly to the 3rd row and the 4th. How can I see the changes between two rows/texts?


Solution

  • Assuming you're using jupyter / ipython and you are just interested in comparisons between a row and that preceding it I would do something like this.

    The general concept is:

    • find shared tokens between the two strings (by splitting on ' ' and finding the intersection of two sets).
    • apply some html formatting to the tokens shared between the two strings.
    • apply this to all rows.
    • output the resulting dataframe as html and render it in ipython.
    import pandas as pd 
    
    data = ['An empty world',
            'So the word is',
            'So word is',
            'No word is']
    
    df = pd.DataFrame(data, columns=['phrase'])
    
    bold = lambda x: f'<b>{x}</b>'
    
    def highlight_shared(string1, string2, format_func):
        shared_toks = set(string1.split(' ')) & set(string2.split(' '))
        return ' '.join([format_func(tok) if tok in shared_toks else tok for tok in string1.split(' ') ])
    
    highlight_shared('the cat sat on the mat', 'the cat is fat', bold)
    
    df['previous_phrase'] = df.phrase.shift(1, fill_value='')
    df['tokens_shared_with_previous'] = df.apply(lambda x: highlight_shared(x.phrase, x.previous_phrase, bold), axis=1)
    
    from IPython.core.display import HTML
    
    HTML(df.loc[:, ['phrase', 'tokens_shared_with_previous']].to_html(escape=False))