Search code examples
pythonsimilarityfuzzywuzzy

Compare items from lists and find similarity


I would like to compare items from two lists (please see below). I am looking for similarity about the items. For example, I have this item from b_list:

http://www.ilcorrieredellanotte.it

which is similar to Corriere della Sera from g_list. An expected output would be:

(ilcorrieredellanotte, corrieredellasera) = (score of similarity)

Also: https://www.ilmattoquotidiano.it, http://www.ilfattoquotidaino.it, and https://ilquotidaino.wordpress.com from b_list are similar to il fatto quotidiano from g_list. An example of output would be:

(ilmattoquotidiano, ilfattoquotidiano) = 90 (they should differ only for 'c') (ilfattoquotidaino, ilfattoquotidiano) = 95 (they differ only for a vowel, that is switched with another)

(ilquotidaino, ilfattoquotidiano) =60 (it is missing 'fatto')

(scores 90, 95, 60 are just used as an example)

I was thinking of using

Ratios = [process.extract(x,g_list) for x in b_list]
result = list()
for ratio in Ratios:
    for match in ratio:
        if match[1] !=100:
            result.append(match)
            break

but the output has giving me something different (for example, it is not included "Il fatto quotidiano" from the list). I think it is because I am comparing list of urls with words separated by spaces and also case sensitive. Any suggestion would be greatly appreciated. Thanks

Lists:

b_list =["http://notiziepericolose.blogspot.com","http://www.ilcorrieredellanotte.it","https://www.ilmattoquotidiano.it","http://ioco.altervista.org/blog/","http://www.ilmessaggio.it","http://www.ilcorriere.cloud","http://www.ilfattoquotidaino.it","https://ilquotidaino.wordpress.com","http://www.liberogiornale.com", ]
b_list=[re.sub(r"https?://(www\.)?", r'', a) for a in black_list]

g_list=["Corriere della Sera","la Repubblica","La Gazzetta dello Sport","Corriere dello Sport-Stadio","Italia Oggi","il Giornale","Tuttosport","il Fatto Quotidiano","Il Mattino","Libero","Leggo"]
g_list =[x.lower() for x in g_list]

Solution

  • Here is a simple difflib example. It's easy to adjust the cutoff parameter for more or less sensitivity as needed.

    import difflib
    
    b_list =["http://notiziepericolose.blogspot.com",
             "http://www.ilcorrieredellanotte.it",
             "https://www.ilmattoquotidiano.it",
             "http://ioco.altervista.org/blog/",
             "http://www.ilmessaggio.it",
             "http://www.ilcorriere.cloud",
             "http://www.ilfattoquotidaino.it",
             "https://ilquotidaino.wordpress.com",
             "http://www.liberogiornale.com", ]
    
    g_list=["Corriere della Sera",
            "la Repubblica",
            "La Gazzetta dello Sport",
            "Corriere dello Sport-Stadio",
            "Italia Oggi",
            "il Giornale",
            "Tuttosport",
            "il Fatto Quotidiano",
            "Il Mattino",
            "Libero",
            "Leggo"]
    
    save_dict = {}
    save_list = []
    
    for g in g_list:
        matches_list = difflib.get_close_matches(g, possibilities=b_list, cutoff=0.35)
        print(g, (matches_list))
    
        if len(matches_list) > 0:
            save_dict[g] = matches_list
            save_list.append([g, matches_list])
    
    print(save_dict)
    
    {'Corriere della Sera': ['http://www.ilcorrieredellanotte.it'],
     'Corriere dello Sport-Stadio': ['http://www.ilcorrieredellanotte.it',
      'http://www.ilcorriere.cloud'],
     'il Giornale': ['http://www.liberogiornale.com'],
     'il Fatto Quotidiano': ['https://www.ilmattoquotidiano.it',
      'http://www.ilfattoquotidaino.it',
      'https://ilquotidaino.wordpress.com']}
    
    print(save_list)
    
    [['Corriere della Sera', ['http://www.ilcorrieredellanotte.it']],
     ['Corriere dello Sport-Stadio',
      ['http://www.ilcorrieredellanotte.it', 'http://www.ilcorriere.cloud']],
     ['il Giornale', ['http://www.liberogiornale.com']],
     ['il Fatto Quotidiano',
      ['https://www.ilmattoquotidiano.it',
       'http://www.ilfattoquotidaino.it',
       'https://ilquotidaino.wordpress.com']]]