I need to replace the POStags (parts of speech) by other POStags.
# From
"When_WRB it_PRP 's_VBZ time_NN for_IN their_PRP$ ..."
# To
"When_ADV it_PRON 's_VERB time_NOUN for_ADP their_PRON ..."
Here is an extract of the universal dependency POS file that does the mapping between the POStags I have and the POStags I want.
IN ADP
NN NOUN
PRP PRON
PRP$ PRON
VBZ VERB
WRB ADV
I can load the mapping into a python dict (with only relevant keys for the example) :
{'VERB': ['MD', 'VB', 'VBZ', 'VBP', 'VBD', 'VBN', 'VBG'],
'NOUN': ['NN', 'NNS', 'NNP', 'NNPS'], 'ADP': ['IN'],
'PRON': ['PRP', 'PRP$', 'WP', 'WP$'],
'ADV': ['RB', 'RBR', 'RBS', 'WRB'],
}
The dictionary is well-formed. I would like to replace each values in the list, by the key. So MD
would be VERB
and NNS
would be NOUN
.
For the moment, i have this code :
import re
# Load the POStag mapping :
with open('POSfile', 'r', encoding='utf-8') as universal:
dict_pos = {}
for line in universal.readlines():
result = re.match('(.+)\s+(.+)', line.strip())
if result.group(2) not in dict_pos:
dict_pos[result.group(2)] = []
dict_pos[result.group(2)].append(result.group(1))
sentence = "When_WRB it_PRP 's_VBZ time_NN for_IN their_PRP$"
# For each target POStag
for key, value in dict_pos.items():
# Replace any of the source POStag by the target POStag
pattern = re.compile('|'.join(value))
if re.match(pattern, sentence):
line = re.sub(pattern, key, sentence)
I'm struggling with list values like {'DET': ['DT', 'PDT', 'WDT']}
.
How to replace one of the value element by the key (DET) here ?
I suggest building each regex pattern separately, either adding \b
/(?<![^\W_])
or \B
if there are word chars in the string, or omitting word boundaries altogether if there are no word chars at all in the string to replace.
Note I also suggest a non-regex way to initialize the dictionary.
See the working Python demo:
import re
text = "When_WRB it_PRP 's_VBZ time_NN for_IN their_PRP$ biannual_JJ powwow_NN ,_, the_DT nation_NN 's_POS manufacturing_NN titans_NNS typically_RB jet_VBP off_RP to_TO the_DT sunny_JJ confines_NNS of_IN resort_NN towns_NNS like_IN Boca_NNP Raton_NNP and_CC Hot_NNP Springs_NNP ._."
pos_file = "CC CONJ \nCD NUM \nDT DET \nEX DT \nFW X \nIN ADP \nJJ ADJ \nJJR ADJ \nJJS ADJ \nLS X \nMD VERB \nNN NOUN \nNNS NOUN \nNNP NOUN \nNNPS NOUN \nPDT DET \nPOS PRT \nPRP PRON \nPRP$ PRON \nRB ADV \nRBR ADV \nRBS ADV \nRP PRT \nSYM X \nTO PRT \nUH X \nVB VERB \nVBZ VERB \nVBP VERB \nVBD VERB \nVBN VERB \nVBG VERB \nWDT DET \nWP PRON \nWP$ PRON \nWRB ADV \n. . \n, . \n: . \n( . \n) . "
# Initializing the dictionary
dict_pos = {}
for line in pos_file.splitlines(): # You will have with open('POSfile', 'r', encoding='utf-8') as pos_file: for line is pos_file:
c = line.strip().split()
dict_pos[c[1]] = dict_pos.get(c[1], list()) + [c[0]]
def to_regex(x):
r = []
if x[0].isalnum() or x[0] == '_':
r.append(r'(?<![^\W_])')
else:
if any(l.isalnum() or l=='_' for l in x):
r.append(r'\B')
r.append(re.escape(x))
if x[-1].isalnum() or x[-1] == '_':
r.append(r'\b')
else:
if any(l.isalnum() or l=='_' for l in x):
r.append(r'\B')
return "".join(r)
rx_dctvals = {}
for key, val in dict_pos.items():
rx_dctvals[re.compile("|".join(sorted([to_regex(v) for v in val], key=len, reverse=True)))] = key
for rx, repl in rx_dctvals.items():
text = rx.sub(repl.replace('\\', '\\\\'), text)
print(text)
Output:
When_ADV it_PRON 's_VERB time_NOUN for_ADP their_PRON biannual_ADJ powwow_NOUN ._. the_DET nation_NOUN 's_PRT manufacturing_NOUN titans_NOUN typically_ADV jet_VERB off_PRT to_PRT the_DET sunny_ADJ confines_NOUN of_ADP resort_NOUN towns_NOUN like_ADP Boca_NOUN Raton_NOUN and_CONJ Hot_NOUN Springs_NOUN ._.
The to_regex(x)
method takes each string from each value and adds a leading word boundary excluding _
((?<![^\W_])
) before each item that starts with a word char, appends \b
to each item that ends with a word char, appends \B
to the end of a word if it does not end with a word char but contains word chars, and does not append any boundaries if there are no word chars in the item at all. It also escapes all special chars.
The for rx, repl in rx_dctvals.items(): text = rx.sub(repl.replace('\\', '\\\\'), text)
part runs all regex replacements sequentially.