Search code examples
pythonregexmultiline

Regex with m flag in Perl vs. Python


I'm trying to automatically translate some simple Perl code with a regex to Python, and I'm having an issue. Here is the Perl code:

$stamp='[stamp]';
$message = "message\n";
$message =~ s/^/$stamp/gm;
print "$message";
[stamp]message

Here is my Python equivalent:

>>> import re
>>> re.sub(re.compile("^", re.M), "[stamp]", "message\n", count=0)
'[stamp]message\n[stamp]'

Note the answer is different (it has an extra [stamp] at the end). How do I generate code that has the same behavior for the regex?


Solution

  • Perl and Python's regex engines differ slightly on the definition of a "line"; Perl does not consider the empty string following a trailing newline in the input string to be a line, Python does.

    Best solution I can come up with is to change "^" to r"^(?=.|\n)" (note r prefix on string to make it a raw literal; all regex should use raw literals). You can also simplify a bit by just calling methods on the compiled regex or call re.sub with the uncompiled pattern, and since count=0 is already the default, you can omit it. Thus, the final code would be either:

    re.compile(r"^(?=.|\n)", re.M).sub("[stamp]", "message\n")
    

    or:

    re.sub(r"^(?=.|\n)", "[stamp]", "message\n", flags=re.M)
    

    Even better would be:

    start_of_line = re.compile(r"^(?=.|\n)", re.M)  # Done once up front
    
    start_of_line.sub("[stamp]", "message\n")  # Done on demand
    

    avoiding recompiling/rechecking compiled regex cache each time, by creating the compiled regex just once and reusing it.

    Alternative solutions:

    1. Split up the lines in a way that will match Perl's definition of a line, then use the non-re.MULTILINE version of the regex per line, then shove them back together, e.g.:

      start_of_line = re.compile(r"^")  # Compile once up front without re.M
      
      # Split lines, keeping ends, in a way that matches Perl's definition of a line
      # then substitute on line-by-line basis
      ''.join([start_of_line.sub("[stamp]", line) for line in "message\n".splitlines(keepends=True)])
      
    2. Strip a single trailing newline, if it exists, up-front, perform regex substitution, add back newline (if applicable):

      message = '...'
      if message.endswith('\n'):
          result = start_of_line.sub("[stamp]", message[:-1]) + '\n'
      else:
          result = start_of_line.sub("[stamp]", message)
      

    Neither option is as succinct/efficient as trying to tweak the regex, but if arbitrary user-supplied regex must be handled, there's always going to be a corner case, and pre-processing to something that removes the Perl/Python incompatibility is a lot safer.