Search code examples
pythonperformancelistpandassequencematcher

Finding closest approximate match between two sets of names


I have two sets of names of which I would like to find the closest match between the two, if no "close enough" match is found I would like to match the name to itself.

My current approach is to create a dataframe with all the possible combinations and use .apply or lists to iterate over and calculate a similarity ratio via SequenceMatcher (imported as sm).

The problem is I have a few thousand names in both lists which results in intractable run times.

My matching criteria ideally would be a sm ratio of >= 0.85 with the first name being found in the second name as a whole word. If these criteria are not met, the name should be matched to itself.

The last step I would like to implement is then to replace the original series with these matched names.

Here is the code for my current approach, please let me know if this is unclear an how I can help clarify:

stop_words = [
             'pharmaceuticals',
             'pharmaceutical',
             'pharma',
             'therapeutic',
             'biopharma',
             'therapeutics',
             'international',
             'biotechnology',
             'vaccines',
             '\&',
             '&',
             'co.',
             'co',
             'biotherapeutics',
             'biotherapeutic',
             'life science',
             'life sciences',
             'laboratories',
             'biologics',
             ]

temp_db_companies = db['Company']

def approximate_match(s1, s2):
    return str(sm(None, str(s1).lower().strip(), str(s2).lower().strip()).ratio()) + '|' + str(s2).strip()


def replace_val_w_best_match(df, col_initial, series_final, stop_words):
    init_names = df[col_initial].str.lower().str.split(" ", expand=True).replace(stop_words, "").fillna('')

    init_names = pd.Series(['' for n in range(len(init_names))]).str.cat([init_names[col] for col in init_names.columns], sep= " ").str.replace('  ', ' ').reset_index()

    matching_df = pd.DataFrame(columns = list(init_names.columns) + list(series_final), data = init_names)

    matching_df = pd.melt(matching_df,
                          id_vars = ['index', 0],
                          value_vars = list(series_final),
                          var_name = 'Comparators',
                          value_name = 'Best match')

#    matching =  matching_df.apply(lambda row: approximate_match(row[0], row['Comparators']), axis = 1)

    l = [(matching_df[0]), list(matching_df['Comparators'])]

    ratio = [sm(None, name1, name2) for name1 in l[0] for name2 in l[1]]

    match = [name2 for name1 in l[0] for name2 in l[1]]

    print(ratio[:5])
    print(match[:5])

Solution

  • What you're probably looking for is the Levenshtein distance algorithm. It computes the minimum number of edits needed to transform one string into the other.

    Check out this library: https://github.com/ztane/python-Levenshtein/

    The Levenshtein library has a class called StringMatcher.py designed to help you with this problem.

    This library also includes a similar functionality: https://github.com/gfairchild/pyxDamerauLevenshtein