Search code examples
pythonsplitpython-repunctuation

Why is Python re not splitting multiple instances of punctuation?


I am trying to split inputted text at spaces, and all special characters like punctuation, while keeping the delimiters. My re pattern works exactly the way I want except that it will not split multiple instances of the punctuation. Here is my re pattern wordsWithPunc = re.split(r'([^-\w]+)',words)

If I have a word like "hello" with two punctuation marks after it then those punctuation marks are split but they remain as the same element. For example "hello,-" will equal "hello",",-" but I want it to be "hello",",","-"

Another example. My name is mud!!! would be split into "My","name","is","mud","!!!" but I want it to be "My","name","is","mud","!","!","!"


Solution

  • You need to make your pattern non-greedy (remove the +) if you want to capture single non-word characters, something like:

    import re
    
    words = 'My name is mud!!!'
    splitted = re.split(r'([^-\w])', words)
    # ['My', ' ', 'name', ' ', 'is', ' ', 'mud', '!', '', '!', '', '!', '']
    

    This will produce also 'empty' matches between non-word characters (because you're slitting on each of them), but you can mitigate that by postprocessing the result to remove empty matches:

    splitted = [match for match in re.split(r'([^-\w])', words) if match]
    # ['My', ' ', 'name', ' ', 'is', ' ', 'mud', '!', '!', '!']
    

    You can further strip spaces in the generator (i.e. ... if match.strip() ...) if you want to get rid off the space matches as well.