I extracted a list of words from a text, but during text preprocessing I have lowercased everything for easier comparison.
My question is how to make the extracted words in the list appear as they exactly appeared in the original text?
I have tried to first tokenize the original text, and then find the closest matches in thise tokenized list to the word list I have extracted from the text. I used the each of the following for finding the closest matches:
But neither of them worked as I wanted. They extract somehow similar words but not exactly as they appear in the original text. I think the problem is that these methods treat lowercased, and uppercased words differently.
Words extracted can be unigram, bigram up to 5-gram.
Example:
I have extracted the following bigram from text [rfid alert], but in original text it appeared like this [RFID alert].
After using
difflib.get_close_matches('rfid alert', original_text_unigram_tokens_list)
it's output was [profile Caller] and not [RFID alert]. That is because python is case-sensitive. I think it found that the bigram in original_text_unigram_tokens_list
with the least number of different characters from [rfid alert] is [profile Caller] so it returned [profile Caller].
Therefore my question is: Is there any ready method or any workaround I could do to return the original form of the ngram as it appeared in text exactly? For instance, I want to get [RFID alert] instead of [profile Caller] in the above example, and so on.
I appreciate any help. Thank you in advance.
Similarly to this question you can take and modify the source code of difflib.get_close_matches
and adapt it to your need.
Modifications I made:
cutoff
default value raised to 0.99 (theoretically it could even be 1.0 but to ensure numerical errors do not influence the results I am passing a smaller number).
s.set_seq1(x.lower())
- so that the comparison was done between lower-cased strings (but returned original x
)
Full code of the modified function:
from difflib import SequenceMatcher, _nlargest # necessary imports of functions used by modified get_close_matches
def get_close_matches_lower(word, possibilities, n=3, cutoff=0.99):
if not n > 0:
raise ValueError("n must be > 0: %r" % (n,))
if not 0.0 <= cutoff <= 1.0:
raise ValueError("cutoff must be in [0.0, 1.0]: %r" % (cutoff,))
result = []
s = SequenceMatcher()
s.set_seq2(word)
for x in possibilities:
s.set_seq1(x.lower()) # lower-case for comparison
if s.real_quick_ratio() >= cutoff and \
s.quick_ratio() >= cutoff and \
s.ratio() >= cutoff:
result.append((s.ratio(), x))
# Move the best scorers to head of list
result = _nlargest(n, result)
# Strip scores for the best n matches
return [x for score, x in result]
And the result on the example you gave:
print(get_close_matches_lower('rfid alert', ['profile Caller','RFID alert']))
Printing:
['RFID alert']