Search code examples
python-3.xlistlist-comparison

How to check if strings in two list are almost equal using python


I'm trying to find the strings in two list that almost match. Suppose there are two list as below

string_list_1 = ['apple_from_2018','samsung_from_2017','htc_from_2015','nokia_from_2010','moto_from_2019','lenovo_decommision_2017']

string_list_2 =
['apple_from_2020','samsung_from_2021','htc_from_2015','lenovo_decommision_2017']

Output
Similar = ['apple_from_2018','samsung_from_2017','htc_from_2015','lenovo_decommision_2017']
Not Similar =['nokia_from_2010','moto_from_2019']

I tried above one using below implementation but it is not giving proper result

similar = []
not_similar = []
for item1 in string_list_1:
   for item2 in string_list_2:
      if SequenceMatcher(a=item1,b=item2).ratio() > 0.90:
         similar.append(item1)
      else:
          not_similar.append(item1)
  

When I tried above implementation it is not as expected. It would be appreciated if someone could identify the missing part and to get required result


Solution

  • You may make use of the following function in order to find similarity between two given strings

    from difflib import SequenceMatcher
    
    def similar(a, b):
        return SequenceMatcher(None, a, b).ratio()
    
    
    print(similar("apple_from_2018", "apple_from_2020"))
    

    Output :

    0.8666666666666667
    

    Thus using this function you may select the strings which cross the threshold value of percentage similarity. Although you may need to reduce your threshold from 90 to maybe 85 in order to get the expected output.

    Thus the following code should work fine for you

    string_list_1 = ['apple_from_2018','samsung_from_2017','htc_from_2015','nokia_from_2010','moto_from_2019','lenovo_decommision_2017']
    
    string_list_2 = ['apple_from_2020','samsung_from_2021','htc_from_2015','lenovo_decommision_2017']
    
    
    
    from difflib import SequenceMatcher
    
    
    similar = []
    not_similar = []
    for item1 in string_list_1:
    
        # Set the state as false
        found = False
        for item2 in string_list_2:
            if SequenceMatcher(None, a=item1,b=item2).ratio() > 0.80:
                similar.append(item1)
                found = True
                break
        
        if not found:
            not_similar.append(item1)
    
    print("Similar : ", similar)
    print("Not Similar : ", not_similar)
    

    Output :

    Similar :  ['apple_from_2018', 'samsung_from_2017', 'htc_from_2015', 'lenovo_decommision_2017']
    Not Similar :  ['nokia_from_2010', 'moto_from_2019']
    
    

    This does cut down on the amount of time and redundant appends. Also I have reduced the similarity measure to 80 since 90 was too high. But feel free to tweak the values.