Search code examples
pythonpython-3.xregexwhitespace

Regex joining words splitted by whitespace and hyphen


My string is quite messy and looks something like this:

s="I'm hope-less and can -not solve this pro- blem on my own. Wo - uld you help me?"

I'd like to have the hyphen (& sometimes whitespace) stripped words together in one list.. Desired output:

list = ['I'm','hopeless','and','cannot','solve','this','problem','on','my','own','.','Would','you','help','me','?']

I tried a lot of different variations, but nothing worked..

rgx = re.compile("([\w][\w'][\w\-]*\w)") s = "My string'" rgx.findall(s)


Solution

  • Here's one way:

    [re.sub(r'\s*-\s*', '', i) for i in re.split(r'(?<!-)\s(?!-)', s)]
    
    # ["I'm", 'hopeless', 'and', 'cannot', 'solve', 'this', 'problem', 'on', 'my', 'own.', 'Would', 'you', 'help', 'me?']
    

    Two operations here:

    1. Split the text based on whitespaces without hyphens using both negative lookahead and negative lookbehind.

    2. In each of the split word, replace the hyphens with possible whitespaces in front or behind to empty string.

    You can see the first operation's demo here: https://regex101.com/r/ayHPvY/2

    And the second: https://regex101.com/r/ayHPvY/1

    Edit: To get the . and ? to be separated as well, use this instead:

    [re.sub(r'\s*-\s*','', i) for i in re.split(r"(?<!-)\s(?!-)|([^\w\s'-]+)", s) if i]
    
    # ["I'm", 'hopeless', 'and', 'cannot', 'solve', 'this', 'problem', 'on', 'my', 'own', '.', 'Would', 'you', 'help', 'me', '?']
    

    The catch was also splitting the non-alphabets, non-whitespace and not hyphens/apostrophe. The if i is necessary as the split might return some None items.