I would like to compare items from two lists (please see below). I am looking for similarity about the items. For example, I have this item from b_list
:
http://www.ilcorrieredellanotte.it
which is similar to Corriere della Sera
from g_list
. An expected output would be:
(ilcorrieredellanotte, corrieredellasera) = (score of similarity)
Also: https://www.ilmattoquotidiano.it, http://www.ilfattoquotidaino.it
, and https://ilquotidaino.wordpress.com
from b_list
are similar to il fatto quotidiano
from g_list
. An example of output would be:
(ilmattoquotidiano, ilfattoquotidiano) = 90
(they should differ only for 'c'
)
(ilfattoquotidaino, ilfattoquotidiano) = 95
(they differ only for a vowel, that is switched with another)
(ilquotidaino, ilfattoquotidiano) =60
(it is missing 'fatto
')
(scores 90, 95, 60 are just used as an example)
I was thinking of using
Ratios = [process.extract(x,g_list) for x in b_list]
result = list()
for ratio in Ratios:
for match in ratio:
if match[1] !=100:
result.append(match)
break
but the output has giving me something different (for example, it is not included "Il fatto quotidiano"
from the list). I think it is because I am comparing list of urls with words separated by spaces and also case sensitive.
Any suggestion would be greatly appreciated. Thanks
Lists:
b_list =["http://notiziepericolose.blogspot.com","http://www.ilcorrieredellanotte.it","https://www.ilmattoquotidiano.it","http://ioco.altervista.org/blog/","http://www.ilmessaggio.it","http://www.ilcorriere.cloud","http://www.ilfattoquotidaino.it","https://ilquotidaino.wordpress.com","http://www.liberogiornale.com", ]
b_list=[re.sub(r"https?://(www\.)?", r'', a) for a in black_list]
g_list=["Corriere della Sera","la Repubblica","La Gazzetta dello Sport","Corriere dello Sport-Stadio","Italia Oggi","il Giornale","Tuttosport","il Fatto Quotidiano","Il Mattino","Libero","Leggo"]
g_list =[x.lower() for x in g_list]
Here is a simple difflib example. It's easy to adjust the cutoff parameter for more or less sensitivity as needed.
import difflib
b_list =["http://notiziepericolose.blogspot.com",
"http://www.ilcorrieredellanotte.it",
"https://www.ilmattoquotidiano.it",
"http://ioco.altervista.org/blog/",
"http://www.ilmessaggio.it",
"http://www.ilcorriere.cloud",
"http://www.ilfattoquotidaino.it",
"https://ilquotidaino.wordpress.com",
"http://www.liberogiornale.com", ]
g_list=["Corriere della Sera",
"la Repubblica",
"La Gazzetta dello Sport",
"Corriere dello Sport-Stadio",
"Italia Oggi",
"il Giornale",
"Tuttosport",
"il Fatto Quotidiano",
"Il Mattino",
"Libero",
"Leggo"]
save_dict = {}
save_list = []
for g in g_list:
matches_list = difflib.get_close_matches(g, possibilities=b_list, cutoff=0.35)
print(g, (matches_list))
if len(matches_list) > 0:
save_dict[g] = matches_list
save_list.append([g, matches_list])
print(save_dict)
{'Corriere della Sera': ['http://www.ilcorrieredellanotte.it'],
'Corriere dello Sport-Stadio': ['http://www.ilcorrieredellanotte.it',
'http://www.ilcorriere.cloud'],
'il Giornale': ['http://www.liberogiornale.com'],
'il Fatto Quotidiano': ['https://www.ilmattoquotidiano.it',
'http://www.ilfattoquotidaino.it',
'https://ilquotidaino.wordpress.com']}
print(save_list)
[['Corriere della Sera', ['http://www.ilcorrieredellanotte.it']],
['Corriere dello Sport-Stadio',
['http://www.ilcorrieredellanotte.it', 'http://www.ilcorriere.cloud']],
['il Giornale', ['http://www.liberogiornale.com']],
['il Fatto Quotidiano',
['https://www.ilmattoquotidiano.it',
'http://www.ilfattoquotidaino.it',
'https://ilquotidaino.wordpress.com']]]