I'm trying to split strings every time I'm encountering a punctuation mark or numbers, such as:
toSplit = 'I2eat!Apples22becauseilike?Them'
result = re.sub('[0123456789,.?:;~!@#$%^&*()]', ' \1',toSplit).split()
The desired output would be:
['I', '2', 'eat', '!', 'Apples', '22', 'becauseilike', '?', 'Them']
However, the code above (although it properly splits where it's supposed to) removes all the numbers and punctuation marks.
Any clarification would be greatly appreciated.
You may tokenize strings like you have into digits, letters, and other chars that are not whitespace, letters and digits using
re.findall(r'\d+|(?:[^\w\s]|_)+|[^\W\d_]+', toSplit)
Here,
\d+
- 1+ digits(?:[^\w\s]|_)+
- 1+ chars other than word and whitespace chars or _
[^\W\d_]+
- any 1+ Unicode letters.See the regex demo.
Matching approach is more flexible than splitting as it also allows tokenizing complex structure. Say, you also want to tokenize decimal (float, double...) numbers. You will just need to use \d+(?:\.\d+)?
instead of \d+
:
re.findall(r'\d+(?:\.\d+)?|(?:[^\w\s]|_)+|[^\W\d_]+', toSplit)
^^^^^^^^^^^^^
See this regex demo.