Search code examples
pythonregexstringstring-matching

Matching incomplete string in Python dictionary


I'm working with a dictionary where keys are strings (often multiple words) and values are integers. Here's a simplified representation of the dictionary:

{
...
'X ontology entity': 0, 
'X entity': 1, 
'image quality': 10, 
'right lower kidney': 10, 
'magnetic resonance imaging': 10312, 
'MR imaging': 10312, 
 ...
}

Problem Statement: I need to iterate over the keys in this dictionary and match them against a series of tokens extracted from a text. For example, consider the following text:

MR imaging shows that the patient suffers from infection in right lower kidney.

I split this text into tokens using whitespace. The goal is to match entire key phrases (like "MR imaging" and "right lower kidney") from the dictionary with these tokens.

Current Approach: So far, I've written code that can successfully identify "MR imaging" but fails to recognize "right lower kidney." Here's my current code:

found = []
for i, t in enumerate(tokens):
    term = [tokens[i]]
    j = deepcopy(i)
    while (' '.join(term) in self.db_terms):
        if j < len(tokens):
            j += 1
            term.append(tokens[j])
    found.append(' '.join(term[:-1]))
return set(found)

The issue arises because "right lower" is not a key in the dictionary, so my code misses the complete phrase "right lower kidney".

I'm looking for suggestions on how to modify my approach to capture multi-word keys like "right lower kidney". Any advice on optimizing the code to handle such cases effectively would be greatly appreciated. Thank you in advance!


Solution

  • It seems like you are dealing with Ngrams. Note, this answer assumes there are many keys in your dictionary as opposed to possible N-grams. In this case, it is more efficient to generate n-grams from the text as opposed to iterating over the dictionary keys (as is the case with the other answer).

    Start with defining the keys dictionary.

    keys = {
    'X ontology entity': 0, 
    'X entity': 1, 
    'image quality': 10, 
    'right lower kidney': 10, 
    'magnetic resonance imaging': 10312, 
    'MR imaging': 10312, 
    }
    

    You will need to generate all N-grams within a range (that you decide), and for each n-gram, determine whether it exists as a key in your dictionary.

    import re
    
    def get_ngrams(tokens, ngram_range):
        return {' '.join(tokens[i:i+r]) 
            for i in range(len(tokens)) for r in range(*ngram_range)}
    
    ngram_range = (1, 4) # Right exclusive.
    tokens = re.sub(r'[^a-zA-Z]', ' ', text).split()
    found_tokens = set(filter(keys.__contains__, get_ngrams(tokens, ngram_range)))
    
    print(found_tokens)
    # {'MR imaging', 'right lower kidney'}
    

    Keep in mind, for larger ranges and strings, this becomes an expensive operation.


    You can optimise a bit by recognising that not all N-grams need to be stored in memory before filtering. We can save big time using a generator and loop:

    def ngrams_generator(tokens, ngram_range):
        yield from (' '.join(tokens[i:i+r]) 
            for i in range(len(tokens)) for r in range(*ngram_range))
    
    found_ngrams = set()
    for ngram in ngrams_generator(tokens, ngram_range):
        if ngram in keys:
            found_ngrams.add(ngram)
    
    print(found_ngrams)
    # {'MR imaging', 'right lower kidney'}