Search code examples
regexnlptokenize

Keeping special marks when splitting text into tokens using regex


I have this text 'I love this but I have a! question to?' and currently using

token_pattern = re.compile(r"(?u)\b\w+\b")
token_pattern.findall(text)

When using this regex I'm getting

['I','love', 'this', 'but', 'I', 'have', 'a', 'question', 'to']

I'm not the one who wrote this regex and I know nothing about regex (tried to understand from example but just gave up trying) and now I need to change this regex in a way that it will keep the question and exclamation marks and will split them to unique tokens also, so it'll return this list

['I','love', 'this', 'but', 'I', 'have', 'a', '!', 'question', 'to', '?']

Any suggestions on how I can do that.


Solution

  • Try this:

    token_pattern = re.compile(r"(?u)[^\w ]|\b\w+\b")
    token_pattern.findall(text)
    

    It matches all non alphanumeric characters as a single match, too.

    If you really only need question and exclamation marks you can change the regex to

    token_pattern = re.compile(r"(?u)[!?]|\b\w+\b")
    token_pattern.findall(text)