python nlp data-analysis data-cleaning difflib

Python: difflib.get_close_matches comparing modified text but returning original

I extracted a list of words from a text, but during text preprocessing I have lowercased everything for easier comparison.

My question is how to make the extracted words in the list appear as they exactly appeared in the original text?

I have tried to first tokenize the original text, and then find the closest matches in thise tokenized list to the word list I have extracted from the text. I used the each of the following for finding the closest matches:

nltk.edit_distance
difflib.get_close_matches

But neither of them worked as I wanted. They extract somehow similar words but not exactly as they appear in the original text. I think the problem is that these methods treat lowercased, and uppercased words differently.

Words extracted can be unigram, bigram up to 5-gram.

Example:

I have extracted the following bigram from text [rfid alert], but in original text it appeared like this [RFID alert].

After using

difflib.get_close_matches('rfid alert', original_text_unigram_tokens_list)

it's output was [profile Caller] and not [RFID alert]. That is because python is case-sensitive. I think it found that the bigram in original_text_unigram_tokens_list with the least number of different characters from [rfid alert] is [profile Caller] so it returned [profile Caller].

Therefore my question is: Is there any ready method or any workaround I could do to return the original form of the ngram as it appeared in text exactly? For instance, I want to get [RFID alert] instead of [profile Caller] in the above example, and so on.

I appreciate any help. Thank you in advance.

Solution

Similarly to this question you can take and modify the source code of difflib.get_close_matches and adapt it to your need.

Modifications I made:

cutoff default value raised to 0.99 (theoretically it could even be 1.0 but to ensure numerical errors do not influence the results I am passing a smaller number).
s.set_seq1(x.lower()) - so that the comparison was done between lower-cased strings (but returned original x)

Full code of the modified function:

from difflib import SequenceMatcher, _nlargest  # necessary imports of functions used by modified get_close_matches

def get_close_matches_lower(word, possibilities, n=3, cutoff=0.99):
    if not n >  0:
        raise ValueError("n must be > 0: %r" % (n,))
    if not 0.0 <= cutoff <= 1.0:
        raise ValueError("cutoff must be in [0.0, 1.0]: %r" % (cutoff,))
    result = []
    s = SequenceMatcher()
    s.set_seq2(word)
    for x in possibilities:
        s.set_seq1(x.lower())  # lower-case for comparison
        if s.real_quick_ratio() >= cutoff and \
           s.quick_ratio() >= cutoff and \
           s.ratio() >= cutoff:
            result.append((s.ratio(), x))

    # Move the best scorers to head of list
    result = _nlargest(n, result)
    # Strip scores for the best n matches
    return [x for score, x in result]

And the result on the example you gave:

print(get_close_matches_lower('rfid alert', ['profile Caller','RFID alert']))

Printing:

['RFID alert']