Search code examples
pythonstringipdifflib

How does the python difflib.get_close_matches() function work?


The following are two arrays:

import difflib
import scipy
import numpy

a1=numpy.array(['198.129.254.73','134.55.221.58','134.55.219.121','134.55.41.41','198.124.252.101'], dtype='|S15')
b1=numpy.array(['198.124.252.102','134.55.41.41','134.55.219.121','134.55.219.137','134.55.220.45', '198.124.252.130'],dtype='|S15')

difflib.get_close_matches(a1[-1],b1,2)

output:

['198.124.252.130', '198.124.252.102']

shouldnt '198.124.252.102' be the closest match for '198.124.252.101'?

I looked at the documentation where they have specified about some floating type weights but no information on algorithm use.

I am in need to find if the absolute difference between the last two octet is 1 (provided the first three octets are same).

So I am finding the closest string first and then checking that closest string for the above condition.

Is there any other function or way to achieve this? Also how does get_close_matches() behave?

ipaddr doesnt seem to have such a manipulation for ips.


Solution

  • Well, there is this part in the docs explaining your issue:

    This does not yield minimal edit sequences, but does tend to yield matches that “look right” to people.

    For getting the results you are expecting you could use the Levenshtein_distance.

    But for comparing IPs I would suggest to use integer comparison:

    >>> parts = [int(s) for s in '198.124.252.130'.split('.')]
    >>> parts2 = [int(s) for s in '198.124.252.101'.split('.')]
    >>> from operator import sub
    >>> diff = sum(d * 10**(3-pos) for pos,d in enumerate(map(sub, parts, parts2)))
    >>> diff
    29
    

    You can use this style to create a compare function:

    from functools import partial
    from operator import sub
    
    def compare_ips(base, ip1, ip2):
        base = [int(s) for s in base.split('.')]
        parts1 = (int(s) for s in ip1.split('.'))
        parts2 = (int(s) for s in ip2.split('.'))
        test1 = sum(abs(d * 10**(3-pos)) for pos,d in enumerate(map(sub, base, parts1)))
        test2 = sum(abs(d * 10**(3-pos)) for pos,d in enumerate(map(sub, base, parts2)))
        return cmp(test1, test2)
    
    base = '198.124.252.101'
    test_list = ['198.124.252.102','134.55.41.41','134.55.219.121',
                 '134.55.219.137','134.55.220.45', '198.124.252.130']
    sorted(test_list, cmp=partial(compare_ips, base))
    # yields:
    # ['198.124.252.102', '198.124.252.130', '134.55.219.121', '134.55.219.137', 
    #  '134.55.220.45', '134.55.41.41']