I am trying to compare texts scraped from different websites to each other. I have a list of text got from a column in a dataframe. To compare texts in this list, I have tried to use similarity (I do not know if there is another way to do the same). This is the code:
from difflib import SequenceMatcher
titles = filtered_dataset['Titles'].tolist()
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
def get_jaccard_sim(str1, str2):
a = set(str1.split())
b = set(str2.split())
c = a.intersection(b)
return float(len(c)) / (len(a) + len(b) - len(c))
similarities=[]
j_similarities=[]
for title in titles:
similarity=similar(title, title+1)
jacc_similarity=get_jaccard_sim(title, title+1) # I would like to compare the first text to the others; then the second one, and so on...
I have got the following error:
TypeError: can only concatenate str (not "int") to str
because of
similarity=similar(title, title+1)
jacc_similarity=get_jaccard_sim(title, title+1)
Could you please help me to fix the error to compare the texts?
You adding title (String) and 1 (int) but in python you cannot add string and integer if you wanna add a string to an integer change that integer to a string. ex: "sampleString"+str(1) = "sampleString1" , str() function changes 1 to '1'. so here type("sampleString") is string and type(str(1)) is string. so you can add both strings together.
use this code
similarity=similar(title, title+str(1))
jacc_similarity=get_jaccard_sim(title, title+str(1))
thank you.