I an trying to reproduce using regex the classical tokenization trick to deal with sentences like
"I didn't like that SO question, but I like pizza!"
The solution that has been proposed in the literature is actually very simple. Prepend with NOT_
every token between "didnt' and the next punctuation mark. So in our example this becomes:
"I didn't NOT_like NOT_that NOT_SO NOT_question, but I like pizza!"
How can we do that using python or regex?
Thanks!
Tokenize using regexes, then split and join like so:
import re
sentence = "I didn't like that SO question, but I like pizza!"
words = re.split("([,.?:!;]|didn't)", sentence)
not_sentence = "".join([word if (idx == 0 or words[idx-1] != "didn't")
else re.sub(r"(\w+)", "NOT_\\1", word)
for idx, word in enumerate(words)])
print(not_sentence)
# I didn't NOT_like NOT_that NOT_SO NOT_question, but I like pizza!