I have this text 'I love this but I have a! question to?' and currently using
token_pattern = re.compile(r"(?u)\b\w+\b")
token_pattern.findall(text)
When using this regex I'm getting
['I','love', 'this', 'but', 'I', 'have', 'a', 'question', 'to']
I'm not the one who wrote this regex and I know nothing about regex (tried to understand from example but just gave up trying) and now I need to change this regex in a way that it will keep the question and exclamation marks and will split them to unique tokens also, so it'll return this list
['I','love', 'this', 'but', 'I', 'have', 'a', '!', 'question', 'to', '?']
Any suggestions on how I can do that.
Try this:
token_pattern = re.compile(r"(?u)[^\w ]|\b\w+\b")
token_pattern.findall(text)
It matches all non alphanumeric characters as a single match, too.
If you really only need question and exclamation marks you can change the regex to
token_pattern = re.compile(r"(?u)[!?]|\b\w+\b")
token_pattern.findall(text)