Check if there is a similar string in the same column

I have a data frame like this,

df
col1             col2
 A        'the value is zero'
 B        'this is a cat'
 C        'the value is one'
 D        'nothing is here'
 E        'the colour is blue'
 F        'this is dog'
 G        'empty sequence'
 H        'the colour is red'
 I        'the colour is green'         1

Now I want the similar kind of strings as flagged as 1 and others as zero, so the final data frame should look like,

col1             col2                 col1
 A        'the value is zero'           1
 B        'this is a cat'               1
 C        'the value is one'            1
 D        'nothing is here'             0
 E        'the colour is blue'          1
 F        'this is dog'                 1
 G        'empty sequence'              0
 H        'the colour is red'           1
 I        'the colour is green'         1

The 0 and 1 can be obtained using SequenceMatcher(SequenceMatcher(None, s1, s2).ratio()) function and with some threshold value we can make it to zero or one.

But if I use for loops to find the similarity between each other then it will take longer time to execute. Looking for some pandas shortcuts/pythonic way to do this efficiently.

Solution

Similarly to is it possible to do fuzzy match merge with python pandas?, we can use difflib and check if we find more than 1 similar string (to exclude its own) by looking at the length of the list returned by difflib.get_close_matches:

import difflib

df['col1'] = [(len(difflib.get_close_matches(x, df['col2'], cutoff=0.7))>1)*1 
              for x in df['col2']]

print(df)

   col1                            col2
0     1             'the value is zero'
1     1                 'this is a cat'
2     1              'the value is one'
3     0               'nothing is here'
4     1            'the colour is blue'
5     1                   'this is dog'
6     0                'empty sequence'
7     1             'the colour is red'
8     1           'the colour is green'

Similarity matrix based on fuzzy matching

One could also be interested in obtaining a similarity matrix setting all values in a pivoted column to 1 if the strings are similar. For this we could proceed similarly as above, but keeping the entire list, exploding it and pivoting the resulting dataframe with pd.crosstab:

df['sim'] = [difflib.get_close_matches(x, df['col2'], cutoff=0.7)  for x in df['col2']]
sim_df = df.explode('sim')
pd.crosstab(sim_df.col2, sim_df.sim)

sim             empty sequence  nothing is here  the colour is blue... the value is zero  this is a cat  this is dog
col2
empty sequence      1                0                     0         ...        0                   0            0
nothing is here     0                1                     0         ...        0                   0            0
the colour is blue  0                0                     1         ...        0                   0            0
the colour is green 0                0                     1         ...        0                   0            0
the colour is red   0                0                     1         ...        0                   0            0
the value is one    0                0                     0         ...        1                   0            0
the value is zero   0                0                     0         ...        1                   0            0
this is a cat       0                0                     0         ...        0                   1            1
this is dog         0                0                     0         ...        0                   1            1