Search code examples
listarraylistjaro-winkler

How to work with gigantic list with jaro-winkler similarity check


I have a list with more than 10 million strings that I need to iterate and get scored percentage points while using a similarity function. I do this by getting an item from another list that will be used to check similarity from the giga list as follows..

similarities = []
del similarities[:]
i = 0
drugs ['amoxil', 'acyclovir', 'univir', ...]
while i < len(drugs):
    for idx, item in enumerate(drugs):        
        similarity1 = jaro.jaro_winkler_metric(text1,item)*100
        similarity2 = jaro.jaro_winkler_metric(text2,item)*100
        similarity3 = jaro.jaro_winkler_metric(text3,item)*100
        similarity4 = jaro.jaro_winkler_metric(textn..,item)*100
        similarityn..= ..


        similarities.append(similarity1)
    i += 1
    return similarities

The texts (text1, text2, etc) to be used are about 50 to 100. The code works well and fast if drug list has 10 or so items. The more items I add, the slower and more problematic it becomes and can take the laptop to freeze if i have 500k items. I have more than 10 Million items to be used in drug list. How can I make this faster without crashing the system? Regards


Solution

  • You might want to take a look at batch_jaro_winkler. I created it for use cases similar to this one, where you want maximum performance. You build a model that you can then reuse for any number of runtime calculations. Pass your drugs or your texts as argument to build_exportable_model, whatever is the bigger list.

    import batch_jaro_winkler as bjw
    
    drugs = ['amoxil', 'acyclovir', 'univir', ...]
    exportable_model = bjw.build_exportable_model(drugs)
    runtime_model = bjw.build_runtime_model(exportable_model)
    for text in ['text1', 'text2', 'text3']:
      similarities = bjw.jaro_winkler_distance(runtime_model, text)
      # similarities = [('amoxil', 0.0), ('acyclovir', 0.5), ('univir', 0.96)]
    

    If you only care about the best results and/or results matching with at least a certain score, I highly recommend passing min_score and n_best_results as argument to bjw.jaro_winkler_distance.