Search code examples
pythonstringtokenize

regex to prepend NOT between word and punctuation


I an trying to reproduce using regex the classical tokenization trick to deal with sentences like

"I didn't like that SO question, but I like pizza!"

The solution that has been proposed in the literature is actually very simple. Prepend with NOT_ every token between "didnt' and the next punctuation mark. So in our example this becomes:

"I didn't NOT_like NOT_that NOT_SO NOT_question, but I like pizza!"

How can we do that using python or regex?

Thanks!


Solution

  • Tokenize using regexes, then split and join like so:

    import re
    sentence = "I didn't like that SO question, but I like pizza!"
    words = re.split("([,.?:!;]|didn't)", sentence)
    not_sentence = "".join([word if (idx == 0 or words[idx-1] != "didn't")
                            else re.sub(r"(\w+)", "NOT_\\1", word)
                            for idx, word in enumerate(words)])
    print(not_sentence)
    # I didn't NOT_like NOT_that NOT_SO NOT_question, but I like pizza!