I am working through a database of names with possible duplicate entries and attempting to identify which we have two of, unfortunately the formatting is a bit less than optimal and some entries have their first name, middle name, last name or maiden names mashed into one string and some have just first and last.
I need a way to see if say 'John Marvulli' matches 'John Michael Marvulli' and be able to do an operation on those matches. However if you try:
>>> 'John Marvulli' in 'John Michael Marvulli'
False
It returns False. Is there an easy way to compare two strings in this manner to see if one name is contained in another?
I recently discovered the power of the difflib
module.
Think this will hekp you:
import difflib
datab = ['Pnk Flooyd', 'John Marvulli',
'Ld Zeppelin', 'John Michael Marvulli',
'Led Zepelin', 'Beetles', 'Pink Fl',
'Beatlez', 'Beatles', 'Poonk LLoyds',
'Pook Loyds']
print datab
print
li = []
s = difflib.SequenceMatcher()
def yield_ratios(s,iterable):
for x in iterable:
s.set_seq1(x)
yield s.ratio()
for text_item in datab:
s.set_seq2(text_item)
for gathered in li:
if any(r>0.45 for r in yield_ratios(s,gathered)):
gathered.append(text_item)
break
else:
li.append([text_item])
for el in li:
print el
result
['Pnk Flooyd', 'Pink Fl', 'Poonk LLoyds', 'Pook Loyds']
['John Marvulli', 'John Michael Marvulli']
['Ld Zeppelin', 'Led Zepelin']
['Beetles', 'Beatlez', 'Beatles']