Search code examples
pythonregexnlptokenizesubstitution

How to reverse the regex in contractions tokenization?


In nlp tokenization, the contractions are sometimes split up as such:

>>> import re
>>> s = 'he cannot fly'
>>> pattern, substitution  = r"(?i)\b(can)(not)\b", r" \1 \2 "
>>> re.sub(pattern, substitution, s)
'he  can not  fly'

To reverse it (i.e. detokenization), I've tried this:

>>> rev_pattern, rev_substitution  = r"(?i)\b(can)\s(not)\b", r" \1\2 "
>>> re.sub(rev_pattern, rev_substitution, s)
'he cannot fly'

The question is r"(?i)\b(can)\s(not)\b" and r" \1\2 " the reverse of the original pattern substitution? Is there other way to reverse this?

In this case, I've manually coded the \s into the pattern. The main problem is there're a bunch of these regexes that are manually coded for tokenization and I've to manually add the \s for all of them:

CONTRACTIONS2 = [re.compile(r"(?i)\b(can)(not)\b"),
                 re.compile(r"(?i)\b(d)('ye)\b"),
                 re.compile(r"(?i)\b(gim)(me)\b"),
                 re.compile(r"(?i)\b(gon)(na)\b"),
                 re.compile(r"(?i)\b(got)(ta)\b"),
                 re.compile(r"(?i)\b(lem)(me)\b"),
                 re.compile(r"(?i)\b(mor)('n)\b"),
                 re.compile(r"(?i)\b(wan)(na) ")]
CONTRACTIONS3 = [re.compile(r"(?i) ('t)(is)\b"),
                 re.compile(r"(?i) ('t)(was)\b")]
CONTRACTIONS4 = [re.compile(r"(?i)\b(whad)(dd)(ya)\b"),
                 re.compile(r"(?i)\b(wha)(t)(cha)\b")]

Is there a way to automatically iterate through the list of regexes and add the \s in between the group without hardcoding the detokenzation regexes

I know that the original tokenization substitution is -> `r' \1 \2 ', so to undo that, I've to change it back to r' \1\2 '.


Solution

  • You can you just put a comment (?#...) in between. Then do pattern.replace

    e.g.:

    PATTERNS = [r"(?i)\b(can)(?#A)(not)\b",
                r"(?i)\b(d)(?#A)('ye)\b",
                r"(?i)\b(gim)(?#A)(me)\b",
                r"(?i)\b(gon)(?#A)(na)\b"]
    CONTRACTIONS = [re.compile(x) for x in PATTERNS]
    REVERSORS    = [re.compile(x.replace('(?#A)', '\s')) for x in PATTERNS]