In nlp
tokenization, the contractions are sometimes split up as such:
>>> import re
>>> s = 'he cannot fly'
>>> pattern, substitution = r"(?i)\b(can)(not)\b", r" \1 \2 "
>>> re.sub(pattern, substitution, s)
'he can not fly'
To reverse it (i.e. detokenization), I've tried this:
>>> rev_pattern, rev_substitution = r"(?i)\b(can)\s(not)\b", r" \1\2 "
>>> re.sub(rev_pattern, rev_substitution, s)
'he cannot fly'
The question is r"(?i)\b(can)\s(not)\b"
and r" \1\2 "
the reverse of the original pattern substitution? Is there other way to reverse this?
In this case, I've manually coded the \s
into the pattern. The main problem is there're a bunch of these regexes that are manually coded for tokenization and I've to manually add the \s
for all of them:
CONTRACTIONS2 = [re.compile(r"(?i)\b(can)(not)\b"),
re.compile(r"(?i)\b(d)('ye)\b"),
re.compile(r"(?i)\b(gim)(me)\b"),
re.compile(r"(?i)\b(gon)(na)\b"),
re.compile(r"(?i)\b(got)(ta)\b"),
re.compile(r"(?i)\b(lem)(me)\b"),
re.compile(r"(?i)\b(mor)('n)\b"),
re.compile(r"(?i)\b(wan)(na) ")]
CONTRACTIONS3 = [re.compile(r"(?i) ('t)(is)\b"),
re.compile(r"(?i) ('t)(was)\b")]
CONTRACTIONS4 = [re.compile(r"(?i)\b(whad)(dd)(ya)\b"),
re.compile(r"(?i)\b(wha)(t)(cha)\b")]
Is there a way to automatically iterate through the list of regexes and add the \s
in between the group without hardcoding the detokenzation regexes
I know that the original tokenization substitution is -> `r' \1 \2 ', so to undo that, I've to change it back to r' \1\2 '.
You can you just put a comment (?#...)
in between. Then do pattern.replace
e.g.:
PATTERNS = [r"(?i)\b(can)(?#A)(not)\b",
r"(?i)\b(d)(?#A)('ye)\b",
r"(?i)\b(gim)(?#A)(me)\b",
r"(?i)\b(gon)(?#A)(na)\b"]
CONTRACTIONS = [re.compile(x) for x in PATTERNS]
REVERSORS = [re.compile(x.replace('(?#A)', '\s')) for x in PATTERNS]