I need some help with Python. This is not the classic subtract List B from List A to make List C. Instead I would like to look at the indexes of the items in List A (city names in a single word) that are not in List B, and store them into a new List C. Also, matched items in List B are not exactly the same that the ones in List A, they come from OCR, so they are little misspelled, I would like to consider a match if they are 90% similar.
e.g.
List A: #all list items are citynames in just a single word
0. Corneria
1. klandasco
2. Blue_Mars
3. Setiro
4. Jeti_lo
5. Neo_Tokyo
List B: #citynames are little misspelled
0. lcandasco
1. Ne0_Tolcyo
So, the result should be...
List C:
[0, 2, 3, 4]
The result items are not important (Corneria, Blue_Mars, Setiro, Jeti_lo), instead I need to keep the original indexes of the items in List A, once the subtraction has been made.
So far Im doing this...
a = ["aaa", "bbb", "ccc", "ddd", "ccc", "eee"]
b = ["bbb", "eee"]
c = [i for i, v in enumerate(a) if v not in b]
print(c)
output...
[0, 2, 3, 4]
But I need to implement the difflib part in order to match items with 90% of similarity, how could I do this using only pure python script, (preferibly using only difflib)???
How about this:
from difflib import SequenceMatcher
max_ratio = 0.9
c = [i for i, v in enumerate(a)
if not any(map(lambda x: SequenceMatcher(None, v, x).ratio()>=max_ratio, b))]
Snippet which used fuzzywuzzy
:
from fuzzywuzzy import fuzz
max_ratio = 90
c = [i for i, v in enumerate(a)
if not any(map(lambda x: fuzz.ratio(v, x)>=max_ratio, b))]
Note. Before using fuzzywuzzy
you should install it.