Search code examples
pythonstringnlpsimilarityfuzzy-comparison

Abbreviation Detection for Python


I am trying to measure the similarity of company names, however I am having difficulties while I'm trying to match the abbreviations for those names. For example:

IBM
The International Business Machines Corporation

I have tried using fuzzywuzzy to measure the similarity:

>>> fuzz.partial_ratio("IBM","The International Business Machines Corporation")
33
>>> fuzz.partial_ratio("General Electric","GE Company")
20
>>> fuzz.partial_ratio("LTCG Holdings Corp","Long Term Care Group Inc")
39
>>> fuzz.partial_ratio("Young Innovations Inc","YI LLC")
33

Do you know any techniques to measure a higher similarity for such abbreviations?


Solution

  • This seems to produce much better results for the set of examples above:

    from fuzzywuzzy import fuzz, process
    
    companies = ['The International Business Machines Corporation','General Electric','Long Term Care Group','Young Innovations Inc']
    abbreviations = ['YI LLC','LTCG Holdings Corp','IBM','GE Company']
    
    queries = [''.join([i[0] for i in j.split()]) for j in companies]
    
    for company in queries:
        print(company, process.extract(company, abbreviations, scorer=fuzz.partial_token_sort_ratio))
    

    This yields:

    TIBMC [('IBM', 100), ('LTCG Holdings Corp', 50), ('YI LLC', 29), ('GE Company', 20)]
    GE [('GE Company', 100), ('LTCG Holdings Corp', 50), ('YI LLC', 0), ('IBM', 0)]
    LTCG [('LTCG Holdings Corp', 100), ('YI LLC', 50), ('GE Company', 25), ('IBM', 0)]
    YII [('YI LLC', 80), ('LTCG Holdings Corp', 33), ('IBM', 33), ('GE Company', 33)]
    

    A small modification to the for loop:

    for query, company in zip(queries, companies):
        print(company, '-', process.extractOne(query, abbreviations, scorer=fuzz.partial_token_sort_ratio))
    

    Gives:

    The International Business Machines Corporation - ('IBM', 100)
    General Electric - ('GE Company', 100)
    Long Term Care Group - ('LTCG Holdings Corp', 100)
    Young Innovations Inc - ('YI LLC', 80)