I'm working with a dictionary where keys are strings (often multiple words) and values are integers. Here's a simplified representation of the dictionary:
{
...
'X ontology entity': 0,
'X entity': 1,
'image quality': 10,
'right lower kidney': 10,
'magnetic resonance imaging': 10312,
'MR imaging': 10312,
...
}
Problem Statement: I need to iterate over the keys in this dictionary and match them against a series of tokens extracted from a text. For example, consider the following text:
MR imaging shows that the patient suffers from infection in right lower kidney.
I split this text into tokens using whitespace. The goal is to match entire key phrases (like "MR imaging" and "right lower kidney") from the dictionary with these tokens.
Current Approach: So far, I've written code that can successfully identify "MR imaging" but fails to recognize "right lower kidney." Here's my current code:
found = []
for i, t in enumerate(tokens):
term = [tokens[i]]
j = deepcopy(i)
while (' '.join(term) in self.db_terms):
if j < len(tokens):
j += 1
term.append(tokens[j])
found.append(' '.join(term[:-1]))
return set(found)
The issue arises because "right lower" is not a key in the dictionary, so my code misses the complete phrase "right lower kidney".
I'm looking for suggestions on how to modify my approach to capture multi-word keys like "right lower kidney". Any advice on optimizing the code to handle such cases effectively would be greatly appreciated. Thank you in advance!
It seems like you are dealing with Ngrams. Note, this answer assumes there are many keys in your dictionary as opposed to possible N-grams. In this case, it is more efficient to generate n-grams from the text as opposed to iterating over the dictionary keys (as is the case with the other answer).
Start with defining the keys
dictionary.
keys = {
'X ontology entity': 0,
'X entity': 1,
'image quality': 10,
'right lower kidney': 10,
'magnetic resonance imaging': 10312,
'MR imaging': 10312,
}
You will need to generate all N-grams within a range (that you decide), and for each n-gram, determine whether it exists as a key in your dictionary.
import re
def get_ngrams(tokens, ngram_range):
return {' '.join(tokens[i:i+r])
for i in range(len(tokens)) for r in range(*ngram_range)}
ngram_range = (1, 4) # Right exclusive.
tokens = re.sub(r'[^a-zA-Z]', ' ', text).split()
found_tokens = set(filter(keys.__contains__, get_ngrams(tokens, ngram_range)))
print(found_tokens)
# {'MR imaging', 'right lower kidney'}
Keep in mind, for larger ranges and strings, this becomes an expensive operation.
You can optimise a bit by recognising that not all N-grams need to be stored in memory before filtering. We can save big time using a generator and loop:
def ngrams_generator(tokens, ngram_range):
yield from (' '.join(tokens[i:i+r])
for i in range(len(tokens)) for r in range(*ngram_range))
found_ngrams = set()
for ngram in ngrams_generator(tokens, ngram_range):
if ngram in keys:
found_ngrams.add(ngram)
print(found_ngrams)
# {'MR imaging', 'right lower kidney'}