Search code examples
pythonnlpnamed-entity-recognitionfuzzy-searchfuzzywuzzy

Efficient way to find an approximate string match and replacing with predefined string


I need to build a NER system (Named Entity Recognition). For simplicity, I am doing it by using approximate string matching as input can contain typos and other minor modifications. I have come across some great libraries like: fuzzywuzzy or even faster RapidFuzz. But unfortunately I didn't find a way to return the position where the match occurs. As, for my purpose I not only need to find the match, but also I need to know where the match happened. As for NER, I need to replace those matches with some predefined string.

For example, If any one of the line is found in input string I want to replace them with the string COMPANY_NAME:

google
microsoft
facebook
International Business Machine

Like, input: S/he works at Google will be transformed to S/he works at COMPANY_NAME. You can safely assume that, all the input and the pattern to match are already preprocessed and most importantly they are in lower-case now. So, there is no problem with case-sensitivity.

Currently, I have approached with a sliding window technique. And a sliding window is passed over the input string from left to right and this window has exactly the size of the pattern we want to match. For example, when I want to match with International Business Machine, I run a sliding window of size 3 from left to right and try to find the best match by observing each 3 consecutive tokens at the same time with a stride of 1. I do believe, it is not the best way to do it, also it cannot find the best match.

So, what is the efficient way to find the best possible match along with the quantification on the found match (how much they are similar) and the position of the match(es), such that we can replace them with a given fixed string (if the calculated similarity is not less than a threshold)? Obviously, a single input may contain multiple portions to be replaced, each of them will be replaced separately, like: Google and Microsoft are big companies will become COMPANY_NAME and COMPANY_NAME are big companies etc.

EDIT: fixed link to RapidFuzz


Solution

  • It seems modules fuzzywuzzy and RapidFuzz don't have function for this. You could try to use process.extract() or process.extractOne() but it would need to split text in smaller parts (ie. words) and check every part separatelly. For longer words like International Business Machine it would need to split in part with 3 words - so it would need even more work.


    I think you need rather module fuzzysearch

    import fuzzysearch
    
    words = ['google', 'microsoft', 'facebook', 'International Business Machine']
    
    text = 'Google and Microsoft are big companies like International Business Machine'
    
    print(' text:', text)
    print('---')
        
    for word in sorted(words, key=len, reverse=True):
        print(' word:', word)
        
        results = fuzzysearch.find_near_matches(word, text, max_l_dist=1)
        print('found:', results)
        
        for item in reversed(results):
            text = text[:item.start] + 'COMPANY' + text[item.end:]
        print(' text:', text)
        
        print('---')
    

    Result:

     text: Google and Microsoft are big companies like facebook International Business Machine
    ---
     word: International Business Machine
    found: [Match(start=53, end=83, dist=0, matched='International Business Machine')]
     text: Google and Microsoft are big companies like facebook COMPANY
    ---
     word: microsoft
    found: [Match(start=11, end=20, dist=1, matched='Microsoft')]
     text: Google and COMPANY are big companies like facebook COMPANY
    ---
     word: facebook
    found: [Match(start=42, end=50, dist=0, matched='facebook')]
     text: Google and COMPANY are big companies like COMPANY COMPANY
    ---
     word: google
    found: [Match(start=0, end=6, dist=1, matched='Google')]
     text: COMPANY and COMPANY are big companies like COMPANY COMPANY
    

    If it finds many results for one word then it is better to start replacing at last position to keep other words in the same place. And this is why I use reversed().

    I would start also with the longest word/name so later it still can search shorter words like Business. And this is why I use sorted(..., key=len, reverse=True)

    But I'm not sure if it works exactly as you want. Maybe it will have problem when words are more incorrect.


    EDIT:

    I tried to use fuzzywuzzy for this and created this version but only for names with single word. For International Business Machine it would need some other idea.

    It split full text into words and compare words. Later replace word wich have ration > 80

    words = ['google', 'microsoft', 'facebook', 'International Business Machine']
    
    text = 'Google and Microsoft are big companies like International Business Machine'
    
    # ---
    
    import fuzzywuzzy.fuzz as fuzz
    #import fuzzywuzzy.process
    
    new_words = []
    
    for part in text.split():
    
        matches = []
    
        for word in words:
            result = fuzz.token_sort_ratio(part, word)
            matches.append([result, part, word])
            #print([result, part, word])
    
        matches = sorted(matches, reverse=True)
    
        if matches and matches[0][0] > 80:
            new_words.append('COMPANY')
        else:
            new_words.append(matches[0][1])
            
    print(" ".join(new_words))
    

    Result:

    [100, 'Google', 'google']
    [27, 'Google', 'microsoft']
    [29, 'Google', 'facebook']
    [17, 'Google', 'International Business Machine']
    [0, 'and', 'google']
    [0, 'and', 'microsoft']
    [18, 'and', 'facebook']
    [12, 'and', 'International Business Machine']
    [27, 'Microsoft', 'google']
    [100, 'Microsoft', 'microsoft']
    [35, 'Microsoft', 'facebook']
    [15, 'Microsoft', 'International Business Machine']
    [22, 'are', 'google']
    [17, 'are', 'microsoft']
    [36, 'are', 'facebook']
    [12, 'are', 'International Business Machine']
    [22, 'big', 'google']
    [17, 'big', 'microsoft']
    [18, 'big', 'facebook']
    [12, 'big', 'International Business Machine']
    [27, 'companies', 'google']
    [33, 'companies', 'microsoft']
    [24, 'companies', 'facebook']
    [26, 'companies', 'International Business Machine']
    [40, 'like', 'google']
    [15, 'like', 'microsoft']
    [17, 'like', 'facebook']
    [18, 'like', 'International Business Machine']
    [21, 'International', 'google']
    [27, 'International', 'microsoft']
    [19, 'International', 'facebook']
    [60, 'International', 'International Business Machine']
    [14, 'Business', 'google']
    [24, 'Business', 'microsoft']
    [12, 'Business', 'facebook']
    [42, 'Business', 'International Business Machine']
    [15, 'Machine', 'google']
    [25, 'Machine', 'microsoft']
    [40, 'Machine', 'facebook']
    [38, 'Machine', 'International Business Machine']
    COMPANY and COMPANY are big companies like International Business Machine
    

    EDIT:

    Second version which check also names with many words

    all_names = ['google', 'microsoft', 'facebook', 'International Business Machine']
    
    text = 'Google and Microsoft are big companies like International Business Machine'
    
    # ---
    
    import fuzzywuzzy.fuzz as fuzz
    
    
    for name in all_names:
    
        length = len(name.split(' ')) # how many words has name 
        print('name length:', length, '|', name)
    
        words = text.split()  # split text into words
    
        # compare name with all words in text
        
        matches = []
        
        for index in range(0, len(words)-length+1):
            # join words if name has more then 1 word
            part = " ".join(words[index:index+length])
            #print('part:', part)
            
            result = fuzz.token_sort_ratio(part, name)
            matches.append([result, name, part, [index, index+length]])
    
            print([result, name, part, [index, index+length]])
            
        # reverse to start at last position
        matches = list(reversed(matches))
    
        max_match = max(x[0] for x in matches)
        print('max match:', max_match)
    
        # replace
        if max_match > 80:
            for match in matches:
                if  match[0] == max_match:
                    idx = match[3]  
                    words = words[:idx[0]] + ['COMPANY'] + words[idx[1]:]
    
        text = " ".join(words)
        print('text:', text)
        print('---')
    

    Result:

    ame length: 1 | google
    [100, 'google', 'Google', [0, 1]]
    [0, 'google', 'and', [1, 2]]
    [27, 'google', 'Microsoft', [2, 3]]
    [22, 'google', 'are', [3, 4]]
    [22, 'google', 'big', [4, 5]]
    [27, 'google', 'companies', [5, 6]]
    [40, 'google', 'like', [6, 7]]
    [21, 'google', 'International', [7, 8]]
    [14, 'google', 'Business', [8, 9]]
    [15, 'google', 'Machine', [9, 10]]
    max match: 100
    text: COMPANY and Microsoft are big companies like International Business Machine
    ---
    name length: 1 | microsoft
    [25, 'microsoft', 'COMPANY', [0, 1]]
    [0, 'microsoft', 'and', [1, 2]]
    [100, 'microsoft', 'Microsoft', [2, 3]]
    [17, 'microsoft', 'are', [3, 4]]
    [17, 'microsoft', 'big', [4, 5]]
    [33, 'microsoft', 'companies', [5, 6]]
    [15, 'microsoft', 'like', [6, 7]]
    [27, 'microsoft', 'International', [7, 8]]
    [24, 'microsoft', 'Business', [8, 9]]
    [25, 'microsoft', 'Machine', [9, 10]]
    max match: 100
    text: COMPANY and COMPANY are big companies like International Business Machine
    ---
    name length: 1 | facebook
    [27, 'facebook', 'COMPANY', [0, 1]]
    [18, 'facebook', 'and', [1, 2]]
    [27, 'facebook', 'COMPANY', [2, 3]]
    [36, 'facebook', 'are', [3, 4]]
    [18, 'facebook', 'big', [4, 5]]
    [24, 'facebook', 'companies', [5, 6]]
    [17, 'facebook', 'like', [6, 7]]
    [19, 'facebook', 'International', [7, 8]]
    [12, 'facebook', 'Business', [8, 9]]
    [40, 'facebook', 'Machine', [9, 10]]
    max match: 40
    text: COMPANY and COMPANY are big companies like International Business Machine
    ---
    name length: 3 | International Business Machine
    [33, 'International Business Machine', 'COMPANY and COMPANY', [0, 3]]
    [31, 'International Business Machine', 'and COMPANY are', [1, 4]]
    [31, 'International Business Machine', 'COMPANY are big', [2, 5]]
    [34, 'International Business Machine', 'are big companies', [3, 6]]
    [38, 'International Business Machine', 'big companies like', [4, 7]]
    [69, 'International Business Machine', 'companies like International', [5, 8]]
    [88, 'International Business Machine', 'like International Business', [6, 9]]
    [100, 'International Business Machine', 'International Business Machine', [7, 10]]
    max match: 100
    text: COMPANY and COMPANY are big companies like COMPANY
    

    EDIT:

    Version with fuzzywuzzy.process

    This time I don't have positions and I simply use standard text.replace(item[0], 'COMPANY').

    I think in most situations it will work correctly and it doesn't need better method.

    This time I check it on text with mistakes:

    'Gogle and Mikro-Soft are big companies like Fasebok and Internat. Businnes Machin'
    
    
    all_names = ['google', 'microsoft', 'facebook', 'International Business Machine']
    
    text = 'Google and Microsoft are big companies like Facebook and International Business Machine'
    
    # text with mistakes
    text = 'Gogle and Mikro-Soft are big companies like Fasebok and Internat. Businnes Machin'
    
    # ---
    
    import fuzzywuzzy.process
    #import fuzzywuzzy.fuzz
    
    for name in sorted(all_names, key=len, reverse=True):
        lenght = len(name.split())
    
        words = text.split()
        words = [" ".join(words[i:i+lenght]) for i in range(0, len(words)-lenght+1)]
        #print(words)
    
        #result = fuzzywuzzy.process.extractBests(name, words, scorer=fuzzywuzzy.fuzz.token_sort_ratio, score_cutoff=80)
        result = fuzzywuzzy.process.extractBests(name, words, score_cutoff=80)
        print(name, result)
    
        for item in result:
            text = text.replace(item[0], 'COMPANY')
    
    print(text)