Search code examples
pythonregexnlp

How can I split at word boundaries with regexes?


I'm trying to do this:

import re
sentence = "How are you?"
print(re.split(r'\b', sentence))

The result being

[u'How are you?']

I want something like [u'How', u'are', u'you', u'?']. How can this be achieved?


Solution

  • Unfortunately, Python cannot split by empty strings.

    To get around this, you would need to use findall instead of split.

    Actually \b just means word boundary.

    It is equivalent to (?<=\w)(?=\W)|(?<=\W)(?=\w).

    That means, the following code would work:

    import re
    sentence = "How are you?"
    print(re.findall(r'\w+|\W+', sentence))