I'm trying to automatically translate some simple Perl code with a regex to Python, and I'm having an issue. Here is the Perl code:
$stamp='[stamp]';
$message = "message\n";
$message =~ s/^/$stamp/gm;
print "$message";
[stamp]message
Here is my Python equivalent:
>>> import re
>>> re.sub(re.compile("^", re.M), "[stamp]", "message\n", count=0)
'[stamp]message\n[stamp]'
Note the answer is different (it has an extra [stamp]
at the end). How do I generate code that has the same behavior for the regex?
Perl and Python's regex engines differ slightly on the definition of a "line"; Perl does not consider the empty string following a trailing newline in the input string to be a line, Python does.
Best solution I can come up with is to change "^"
to r"^(?=.|\n)"
(note r
prefix on string to make it a raw literal; all regex should use raw literals). You can also simplify a bit by just calling methods on the compiled regex or call re.sub
with the uncompiled pattern, and since count=0
is already the default, you can omit it. Thus, the final code would be either:
re.compile(r"^(?=.|\n)", re.M).sub("[stamp]", "message\n")
or:
re.sub(r"^(?=.|\n)", "[stamp]", "message\n", flags=re.M)
Even better would be:
start_of_line = re.compile(r"^(?=.|\n)", re.M) # Done once up front
start_of_line.sub("[stamp]", "message\n") # Done on demand
avoiding recompiling/rechecking compiled regex cache each time, by creating the compiled regex just once and reusing it.
Alternative solutions:
Split up the lines in a way that will match Perl's definition of a line, then use the non-re.MULTILINE
version of the regex per line, then shove them back together, e.g.:
start_of_line = re.compile(r"^") # Compile once up front without re.M
# Split lines, keeping ends, in a way that matches Perl's definition of a line
# then substitute on line-by-line basis
''.join([start_of_line.sub("[stamp]", line) for line in "message\n".splitlines(keepends=True)])
Strip a single trailing newline, if it exists, up-front, perform regex substitution, add back newline (if applicable):
message = '...'
if message.endswith('\n'):
result = start_of_line.sub("[stamp]", message[:-1]) + '\n'
else:
result = start_of_line.sub("[stamp]", message)
Neither option is as succinct/efficient as trying to tweak the regex, but if arbitrary user-supplied regex must be handled, there's always going to be a corner case, and pre-processing to something that removes the Perl/Python incompatibility is a lot safer.