I'm working with long strings and I need to replace with ''
all the combinations of adjacent full stops .
and/or colons :
, but only when they are not adjacent to any whitespace. Examples:
a.bcd
should give abcd
a..::.:::.:bcde.....:fg
should give abcdefg
a.b.c.d.e.f.g.h
should give abcdefgh
a .b
should give a .b
, because .
here is adjacent to a whitespace on its left, so it has not to be replaceda..::.:::.:bcde.. ...:fg
should give abcde.. ...:fg
for the same reasonWell, here is what I tried (without any success).
Attempt 1:
s1 = r'a.b.c.d.e.f.g.h'
re.sub(re.search(r'[^\s.:]+([.:]+)[^\s.:]+', s1).group(1), r'', s1)
I would expect to get 'abcdefgh'
but what I actually get is r''
. I understood why: the code
re.search(r'[^\s.:]+([.:]+)[^\s.:]+', s1).group(1)
returns '.'
instead of '\.'
, and thus re.search
doesn't understand that it has to replace the single full stop .
rather than understanding '.'
as the usual regex.
Attempt 2:
s1 = r'a.b.c.d.e.f.g.h'
re.sub(r'([^\s.:]*\S)[.:]+(\S[^\s.:]*)', r'\g<1>\g<2>', s1)
This doesn't work as it returns a.b.c.d.e.f.gh
.
Attempt 3:
s1 = r'a.b.c.d.e.f.g.h'
re.sub(r'([^\s.:]*)[.:]+([^\s.:]*)', r'\g<1>\g<2>', s1)
This works on s1
, but it doesn't solve my problem because on s2 = r'a .b'
it returns a b
rather than a .b
.
Any suggestion?
There are multiple problems here. Your regex doesn't match what you want to match; but also, your understanding of re.sub
and re.search
is off.
To find something, re.search
lets you find where in a string that something occurs.
To replace that something, use re.sub
on the same regular expression instead of re.search
, not as well.
And, understand that re.sub(r'thing(moo)other', '', s1)
replaces the entire match with the replacement string.
With that out of the way, for your regex, it sounds like you want
r'(?<![\s.:])[.:]+(?![\s.:])' # updated from comments, thanks!
which contains a character class with full stop and colon (notice how no backslash is necessary inside the square brackets -- this is a context where dot and colon do not have any special meaning1), repeated as many times as possible; and lookarounds on both sides to say we cannot match these characters when there is whitespace \s
on either side, and also excluding the characters themselves so that there is no way for the regex engine to find a match by applying the +
less strictly (it will do its darndest to find a match if there is a way).
Now, the regex only matches the part you want to actually replace, so you can do
>>> import re
>>> s1 = 'name.surname@domain.com'
>>> re.sub(r'(?<![\s.:])[.:]+(?![\s.:])', r'', s1)
'namesurname@domaincom'
though in the broader scheme of things, you also need to know how to preserve some parts of the match. For the purpose of this demonstration, I will use a regular expression which captures into parenthesized groups the text before and after the dot or colon:
>>> re.sub(r'(.*\S)[.:]+(\S.*)', r'\g<1>\g<2>', s1)
'name.surname@domaincom'
See how \g<1>
in the replacement string refers back to "whatever the first set of parentheses matched" and similarly \g<2>
to the second parenthesized group.
You will also notice that this failed to replace the first full stop, because the .*
inside the first set of parentheses matches as much of the string as possible. To avoid this, you need a regex which only matches as little as possible. We already solved that above with the lookarounds, so I will leave you here, though it would be interesting (and yet not too hard) to solve this in a different way.
1 You could even say that the normal regex language (or syntax, or notation, or formalism) is separate from the language (or syntax, or notation, or formalism) inside square brackets!