Search code examples
pythonnlpdata-analysisdata-cleaningdifflib

Python: difflib.get_close_matches comparing modified text but returning original


I extracted a list of words from a text, but during text preprocessing I have lowercased everything for easier comparison.

My question is how to make the extracted words in the list appear as they exactly appeared in the original text?

I have tried to first tokenize the original text, and then find the closest matches in thise tokenized list to the word list I have extracted from the text. I used the each of the following for finding the closest matches:

  1. nltk.edit_distance
  2. difflib.get_close_matches

But neither of them worked as I wanted. They extract somehow similar words but not exactly as they appear in the original text. I think the problem is that these methods treat lowercased, and uppercased words differently.

Words extracted can be unigram, bigram up to 5-gram.

Example:

I have extracted the following bigram from text [rfid alert], but in original text it appeared like this [RFID alert].

After using

difflib.get_close_matches('rfid alert', original_text_unigram_tokens_list)

it's output was [profile Caller] and not [RFID alert]. That is because python is case-sensitive. I think it found that the bigram in original_text_unigram_tokens_list with the least number of different characters from [rfid alert] is [profile Caller] so it returned [profile Caller].

Therefore my question is: Is there any ready method or any workaround I could do to return the original form of the ngram as it appeared in text exactly? For instance, I want to get [RFID alert] instead of [profile Caller] in the above example, and so on.

I appreciate any help. Thank you in advance.


Solution

  • Similarly to this question you can take and modify the source code of difflib.get_close_matches and adapt it to your need.

    Modifications I made:

    • cutoff default value raised to 0.99 (theoretically it could even be 1.0 but to ensure numerical errors do not influence the results I am passing a smaller number).

    • s.set_seq1(x.lower()) - so that the comparison was done between lower-cased strings (but returned original x)

    Full code of the modified function:

    from difflib import SequenceMatcher, _nlargest  # necessary imports of functions used by modified get_close_matches
    
    def get_close_matches_lower(word, possibilities, n=3, cutoff=0.99):
        if not n >  0:
            raise ValueError("n must be > 0: %r" % (n,))
        if not 0.0 <= cutoff <= 1.0:
            raise ValueError("cutoff must be in [0.0, 1.0]: %r" % (cutoff,))
        result = []
        s = SequenceMatcher()
        s.set_seq2(word)
        for x in possibilities:
            s.set_seq1(x.lower())  # lower-case for comparison
            if s.real_quick_ratio() >= cutoff and \
               s.quick_ratio() >= cutoff and \
               s.ratio() >= cutoff:
                result.append((s.ratio(), x))
    
        # Move the best scorers to head of list
        result = _nlargest(n, result)
        # Strip scores for the best n matches
        return [x for score, x in result]
    

    And the result on the example you gave:

    print(get_close_matches_lower('rfid alert', ['profile Caller','RFID alert']))
    

    Printing:

    ['RFID alert']