Search code examples
pythonregexdictionarynlp

Replace string with key from values list of dict with regexp or not


I need to replace the POStags (parts of speech) by other POStags.

# From
"When_WRB it_PRP 's_VBZ time_NN for_IN their_PRP$ ..."
# To
"When_ADV it_PRON 's_VERB time_NOUN for_ADP their_PRON ..."

Here is an extract of the universal dependency POS file that does the mapping between the POStags I have and the POStags I want.

IN   ADP
NN   NOUN
PRP  PRON
PRP$ PRON
VBZ  VERB
WRB  ADV

I can load the mapping into a python dict (with only relevant keys for the example) :

{'VERB': ['MD', 'VB', 'VBZ', 'VBP', 'VBD', 'VBN', 'VBG'],
 'NOUN': ['NN', 'NNS', 'NNP', 'NNPS'], 'ADP': ['IN'],
 'PRON': ['PRP', 'PRP$', 'WP', 'WP$'],
 'ADV': ['RB', 'RBR', 'RBS', 'WRB'],
}

The dictionary is well-formed. I would like to replace each values in the list, by the key. So MD would be VERB and NNS would be NOUN.

For the moment, i have this code :

import re

# Load the POStag mapping :
with open('POSfile', 'r', encoding='utf-8') as universal:
    dict_pos = {}
    for line in universal.readlines():
        result = re.match('(.+)\s+(.+)', line.strip())
        if result.group(2) not in dict_pos:
            dict_pos[result.group(2)] = []
        dict_pos[result.group(2)].append(result.group(1))

sentence = "When_WRB it_PRP 's_VBZ time_NN for_IN their_PRP$"
# For each target POStag
for key, value in dict_pos.items():
    # Replace any of the source POStag by the target POStag
    pattern = re.compile('|'.join(value))
    if re.match(pattern, sentence):
        line = re.sub(pattern, key, sentence)

I'm struggling with list values like {'DET': ['DT', 'PDT', 'WDT']}.

How to replace one of the value element by the key (DET) here ?


Solution

  • I suggest building each regex pattern separately, either adding \b/(?<![^\W_]) or \B if there are word chars in the string, or omitting word boundaries altogether if there are no word chars at all in the string to replace.

    Note I also suggest a non-regex way to initialize the dictionary.

    See the working Python demo:

    import re
    text = "When_WRB it_PRP 's_VBZ time_NN for_IN their_PRP$ biannual_JJ powwow_NN ,_, the_DT nation_NN 's_POS manufacturing_NN titans_NNS typically_RB jet_VBP off_RP to_TO the_DT sunny_JJ confines_NNS of_IN resort_NN towns_NNS like_IN Boca_NNP Raton_NNP and_CC Hot_NNP Springs_NNP ._."
    pos_file = "CC  CONJ  \nCD  NUM  \nDT  DET  \nEX  DT  \nFW  X  \nIN  ADP    \nJJ  ADJ    \nJJR ADJ   \nJJS ADJ    \nLS  X    \nMD  VERB    \nNN  NOUN  \nNNS NOUN  \nNNP NOUN  \nNNPS    NOUN  \nPDT DET \nPOS PRT  \nPRP PRON  \nPRP$    PRON  \nRB  ADV  \nRBR ADV  \nRBS ADV  \nRP  PRT  \nSYM X  \nTO  PRT  \nUH  X  \nVB  VERB  \nVBZ VERB  \nVBP VERB  \nVBD VERB  \nVBN VERB  \nVBG VERB  \nWDT DET  \nWP  PRON  \nWP$ PRON  \nWRB ADV  \n.   .  \n,   . \n:   .  \n(   .  \n)   .  "
    
    # Initializing the dictionary
    dict_pos = {}
    for line in pos_file.splitlines(): # You will have with open('POSfile', 'r', encoding='utf-8') as pos_file: for line is pos_file:
        c = line.strip().split()
        dict_pos[c[1]] = dict_pos.get(c[1], list()) + [c[0]]
    
    def to_regex(x):
        r = []
        if x[0].isalnum() or x[0] == '_':
            r.append(r'(?<![^\W_])')
        else:
            if any(l.isalnum() or l=='_' for l in x):
                r.append(r'\B')
        r.append(re.escape(x))
        if x[-1].isalnum() or x[-1] == '_':
            r.append(r'\b')
        else:
            if any(l.isalnum() or l=='_' for l in x):
                r.append(r'\B')
        return "".join(r)
    
    rx_dctvals = {}
    for key, val in dict_pos.items():
        rx_dctvals[re.compile("|".join(sorted([to_regex(v) for v in val], key=len, reverse=True)))] = key
    
    for rx, repl in rx_dctvals.items():
        text = rx.sub(repl.replace('\\', '\\\\'), text)
    
    print(text)
    

    Output:

    When_ADV it_PRON 's_VERB time_NOUN for_ADP their_PRON biannual_ADJ powwow_NOUN ._. the_DET nation_NOUN 's_PRT manufacturing_NOUN titans_NOUN typically_ADV jet_VERB off_PRT to_PRT the_DET sunny_ADJ confines_NOUN of_ADP resort_NOUN towns_NOUN like_ADP Boca_NOUN Raton_NOUN and_CONJ Hot_NOUN Springs_NOUN ._.
    

    The to_regex(x) method takes each string from each value and adds a leading word boundary excluding _ ((?<![^\W_])) before each item that starts with a word char, appends \b to each item that ends with a word char, appends \B to the end of a word if it does not end with a word char but contains word chars, and does not append any boundaries if there are no word chars in the item at all. It also escapes all special chars.

    The for rx, repl in rx_dctvals.items(): text = rx.sub(repl.replace('\\', '\\\\'), text) part runs all regex replacements sequentially.