Search code examples
pythonstringmatchsubtractiondifflib

Subtract List B from List A, but keeping the List A index and using difflib string similarity


I need some help with Python. This is not the classic subtract List B from List A to make List C. Instead I would like to look at the indexes of the items in List A (city names in a single word) that are not in List B, and store them into a new List C. Also, matched items in List B are not exactly the same that the ones in List A, they come from OCR, so they are little misspelled, I would like to consider a match if they are 90% similar.

e.g.

List A: #all list items are citynames in just a single word

0. Corneria
1. klandasco
2. Blue_Mars
3. Setiro
4. Jeti_lo
5. Neo_Tokyo

List B: #citynames are little misspelled

0. lcandasco
1. Ne0_Tolcyo

So, the result should be...

List C:

[0, 2, 3, 4]

The result items are not important (Corneria, Blue_Mars, Setiro, Jeti_lo), instead I need to keep the original indexes of the items in List A, once the subtraction has been made.

So far Im doing this...

a = ["aaa", "bbb", "ccc", "ddd", "ccc", "eee"]
b = ["bbb", "eee"]
c = [i for i, v in enumerate(a) if v not in b]
print(c)

output...

[0, 2, 3, 4]

But I need to implement the difflib part in order to match items with 90% of similarity, how could I do this using only pure python script, (preferibly using only difflib)???


Solution

  • How about this:

    from difflib import SequenceMatcher
    
    max_ratio = 0.9
    
    c = [i for i, v in enumerate(a) 
         if not any(map(lambda x: SequenceMatcher(None, v, x).ratio()>=max_ratio, b))]
    

    Snippet which used fuzzywuzzy:

    from fuzzywuzzy import fuzz
    
    max_ratio = 90
    
    c = [i for i, v in enumerate(a) 
         if not any(map(lambda x: fuzz.ratio(v, x)>=max_ratio, b))]
    

    Note. Before using fuzzywuzzy you should install it.