Search code examples
pythontokenizetext-processing

Split string on n or more whitespaces


I have a string like that:

sentence = 'This is   a  nice    day'

I want to have the following output:

output = ['This is', 'a  nice',  'day']

In this case, I split the string on n=3 or more whitespaces and this is why it is split like it is shown above.

How can I efficiently do this for any n?


Solution

  • You may try using Python's regex split:

    sentence = 'This is   a  nice day'
    output = re.split(r'\s{3,}', sentence)
    print(output)
    
    ['This is', 'a  nice day']
    

    To handle this for an actual variable n, we can try:

    n = 3
    pattern = r'\s{' + str(n) + ',}'
    output = re.split(pattern, sentence)
    print(output)
    
    ['This is', 'a  nice day']