I have a little problem, I am trying to compare 2 lists with words in it to establish a similarity percentage but here is the thing, if I have the same word 2 times in each lists, I get a falsied percentage.
First I made this little script :
data1 = ['test', 'super', 'class', 'test', 'boom']
data2 = ['test', 'super', 'class', 'test', 'boom']
res = 0
nb = (len(data1) + len(data2)) / 2
if data1 and data2 and nb != 0:
for id1, item1 in enumerate(data1):
for id2, item2 in enumerate(data2):
if item1 == item2:
res += 1 - abs(id1 - id2) / nb
print(res / nb * 100)
The problem is that if i have 2 time the same word in the lists the percentage will be greater than 100%. So to counter that, i added a 'break' just after the line 'res += 1 - abs(id1 - id2) / nb' but the percentage is still falsified.
I hope you've understand my problem, thanks you for your help !
You can use difflib.SequenceMatcher
instead to compare the similarity of two lists. Try this :
from difflib import SequenceMatcher as sm
data1 = ['test', 'super', 'class', 'test', 'boom']
data2 = ['test', 'super', 'class', 'test', 'boom']
matching_percentage = sm(None, data1, data2).ratio() * 100
Output :
100.0