Search code examples
pythonstringdictionarynlpnormalization

How to replace compound words in a string using a dictionary?


I have a dictionary whose key:value pairs correspond to compound words and the expression i want to replace them for in a text. For example let's say:

terms_dict = {'digi conso': 'digi conso', 'digi': 'digi conso', 'digiconso': 'digi conso', '3xcb': '3xcb', '3x cb': '3xcb', 'legal entity identifier': 'legal entity identifier'}

My goal is to create a function replace_terms(text, dict) that takes a text and a dictionary like this one as parameters, and returns the text after replacing the compound words.

For instance, this script:

test_text = "i want a digi conso loan for digiconso" 

print(replace_terms(test_text, terms_dict))

Should return:

"i want a digi conso loan for digi conso"

I have tried using .replace() but for some reasons it doesn't work properly, probably because the terms to replace are composed of multiple words.

I also tried this:

def replace_terms(text, terms_dict):
    if len(terms_dict) > 0:
        words_in = [k for k in terms_dict.keys() if k in text]  # ex: words_in = [digi conso, digi, digiconso]
        if len(words_in) > 0:
            for w in words_in:
                pattern = r"\b" + w + r"\b"
                text = re.sub(pattern, terms_dict[w], text)

    return text

But when applied to my text, this function returns: "i want a digi conso conso loan for digi conso", the word conso get's doubled and I can see why (because the words_in list is created by going through the dictionary keys, and the text is not altered when one key is appended to the list).

Is there an efficient way to do this?

Thanks a lot!


Solution

  • A rather quick and wonky way of doing this:

    from typing import Dict, List, Tuple
    
    
    def replace_terms(text: str, terms: Dict[str, str]) -> str:
        replacement_list: List[Tuple[int, str]] = []
        check = True
        for term in terms:
            if term in text:
                for replacement in replacement_list:
                    if replacement[0] == text.index(term):
                        if len(term) > len(replacement[1]):
                            replacement_list.remove(replacement)
                        else:
                            check = False
                if check:
                    replacement_list.append((text.index(term), term))
                else:
                    check = True
        for replacement in replacement_list:
            text = text.replace(replacement[1], terms[replacement[1]], 1)
        return text
    

    Usage:

    terms_dict = {
        "digi conso": "digi conso",
        "digi": "digi conso",
        "digiconso": "digi conso",
        "3xcb": "3xcb",
        "3x cb": "3xcb",
        "legal entity identifier": "legal entity identifier"
    }
    
    test_text = "i want a digi conso loan for digiconso"
    
    print(replace_terms(test_text, terms_dict))
    

    Result:

    i want a digi conso loan for digi conso