I am trying to remove the spaces that occur between punctuation characters in a sentence. To illustrate, the dataset has many strings that look like this:
"This is a very nice text : ) : ) ! ! ! ."
But I want them to look like this:
"This is a very nice text :):)!!!."
I want to do this by using a RegEx positive lookahead, but can someone show me how to do this in Python. I now have code but it does exactly the opposite of what I want by adding extra spaces:
string = re.sub('([.,!?()])', r' \1', string)
In principle you could find the space (spaces?) between punctuation characters (that you capture) and substitute the captured punctuation characters only:
string = re.sub('([:.,!?()]) ([:.,!?()])', r'\1\2', string)
However, this would result in
This is a very nice text :) :) !! !.
since re.sub
does not consider overlapping matches.
Hence, you need to use the zero-width look-ahead and look-behind - they are not counted into the match, so the matched portion is just the space character, that we then substitute to an empty string.
string = re.sub('(?<=[:.,!?()]) (?=[:.,!?()])', '', string)
with which the result is 'This is a very nice text :):)!!!.'