I'm trying to split an input string to be tokenized in Python, but all attempts made thus far result in the Python "nothing to repeat" complaint.
Currently, I'm using re.findall instead of re.split, but I'm not sure where my mistake is with my regex.
My current regex looks like this:
inputList = re.findall(r"[\w']+|[.,!?;]|[\s]", testString)
I want to split on punctuation or whitespace.
I previously tried:
inputList = re.split(r'(\s|\W)*', testString)
But this would give me undesirable output strings.
I also tried:
inputList = re.split(r'(\s+)|([.,!?;]+)', testString)
But was getting the same error.
an example of testString:
testString = "Beautiful King John! ??? I'm here. It's 'bout time."
an example of desired output:
['Beautiful', ' ', 'King', ' ', 'John', '!', ' ', '?', '?', '?', ' ', "I'm", ' ', 'here', '.', ' ', "It's", ' ', "'bout", ' ', 'time', '.']
I'm getting the right output with my re.findall, but Python is throwing the error and I'd like to be rid of it, if possible. Could someone point out the error I'm making with my regex?
for your example this works, but gives empty strings too:
re.split(r'([ !?.])', testString)
# ['Beautiful', ' ', 'King', ' ', 'John', '!', '', ' ', '', '?', '', '?', '', '?', '', ' ', "I'm", ' ', 'here', '.', '', ' ', "It's", ' ', "'bout", ' ', 'time', '.', '']
but your desired output is then just a filter op away:
inputList = [t for t in re.split(r'([ !?.])', t) if t]
# ['Beautiful', ' ', 'King', ' ', 'John', '!', ' ', '?', '?', '?', ' ', "I'm", ' ', 'here', '.', ' ', "It's", ' ', "'bout", ' ', 'time', '.']