Search code examples
pythoncomparisonstring-comparisoncomparison-operators

Python "in" Comparison of Strings of Different Word Length


I am working through a database of names with possible duplicate entries and attempting to identify which we have two of, unfortunately the formatting is a bit less than optimal and some entries have their first name, middle name, last name or maiden names mashed into one string and some have just first and last.

I need a way to see if say 'John Marvulli' matches 'John Michael Marvulli' and be able to do an operation on those matches. However if you try:

>>> 'John Marvulli' in 'John Michael Marvulli'
False

It returns False. Is there an easy way to compare two strings in this manner to see if one name is contained in another?


Solution

  • I recently discovered the power of the difflib module.
    Think this will hekp you:

    import difflib
    
    datab = ['Pnk Flooyd', 'John Marvulli',
             'Ld Zeppelin', 'John Michael Marvulli',
             'Led Zepelin', 'Beetles', 'Pink Fl',
             'Beatlez', 'Beatles', 'Poonk LLoyds',
             'Pook Loyds']
    print datab
    print
    
    
    li = []
    s = difflib.SequenceMatcher()
    
    def yield_ratios(s,iterable):
        for x in iterable:
            s.set_seq1(x)
            yield s.ratio()
    
    for text_item in datab:
        s.set_seq2(text_item)
        for gathered in li:
            if any(r>0.45 for r in yield_ratios(s,gathered)):
                gathered.append(text_item)
                break
        else:
            li.append([text_item])
    
    
    for el in li:
        print el
    

    result

    ['Pnk Flooyd', 'Pink Fl', 'Poonk LLoyds', 'Pook Loyds']
    ['John Marvulli', 'John Michael Marvulli']
    ['Ld Zeppelin', 'Led Zepelin']
    ['Beetles', 'Beatlez', 'Beatles']