I am splitting text para
and preserving the line breaks \n
using the following
from nltk import SpaceTokenizer
para="\n[STUFF]\n comma, with period. the new question? \n\nthe\n \nline\n new char*"
sent=SpaceTokenizer().tokenize(para)
Which gives me the following
print(sent)
['\n[STUFF]\n', '', 'comma,', '', 'with', 'period.', 'the', 'new', 'question?', '\n\nthe\n', '', '\nline\n', 'new', 'char*']
My goal is to get the following output
['\n[STUFF]\n', '', 'comma', ',', '', 'with', 'period', '.', 'the', 'new', 'question', '?', '\n\nthe\n', '', '\nline\n', 'new', 'char*']
That is to say, I would like to split the 'comma,'
into 'comma'
, ','
split the 'period.'
into 'period'
, '.'
split the 'question?'
into 'question'
, '?'
while
preserving the \n
I have tried word_tokenize
and it will achieve splitting 'comma'
, ','
etc but does not preserve \n
What can I do to further split sent
as shown above while preserving \n
?
per @randy suggestion to look https://docs.python.org/3/library/re.html#re.split
import re
para = re.split(r'(\W+)', '\n[STUFF]\n comma, with period. the new question? \n\nthe\n \nline\n new char*')
print(para)
Output (close to what I am looking for)
['', '\n[', 'STUFF', ']\n ', 'comma', ', ', 'with', ' ', 'period', '. ', 'the', ' ', 'new', ' ', 'question', '? \n\n', 'the', '\n \n', 'line', '\n ', 'new', ' ', 'char', '*', '']