Search code examples
pythonstringsplitnltktokenize

splitting text further while preserving line breaks


I am splitting text para and preserving the line breaks \n using the following

from nltk import SpaceTokenizer
para="\n[STUFF]\n  comma,  with period. the new question? \n\nthe\n  \nline\n new char*"
sent=SpaceTokenizer().tokenize(para)

Which gives me the following print(sent)

['\n[STUFF]\n', '', 'comma,', '', 'with', 'period.', 'the', 'new', 'question?', '\n\nthe\n', '', '\nline\n', 'new', 'char*']

My goal is to get the following output

['\n[STUFF]\n', '', 'comma', ',', '', 'with', 'period', '.', 'the', 'new', 'question', '?', '\n\nthe\n', '', '\nline\n', 'new', 'char*']

That is to say, I would like to split the 'comma,' into 'comma', ',' split the 'period.' into 'period', '.' split the 'question?' into 'question', '?' while preserving the \n

I have tried word_tokenize and it will achieve splitting 'comma', ',' etc but does not preserve \n

What can I do to further split sent as shown above while preserving \n?


Solution

  • per @randy suggestion to look https://docs.python.org/3/library/re.html#re.split

    import re
    para = re.split(r'(\W+)', '\n[STUFF]\n  comma,  with period. the new question? \n\nthe\n  \nline\n new char*')
    print(para)
    

    Output (close to what I am looking for)

    ['', '\n[', 'STUFF', ']\n  ', 'comma', ',  ', 'with', ' ', 'period', '. ', 'the', ' ', 'new', ' ', 'question', '? \n\n', 'the', '\n  \n', 'line', '\n ', 'new', ' ', 'char', '*', '']