Search code examples
pythonpython-re

Match %, avoid match of \%


I am trying to remove comments from my TeX code. I want to trim text after %, but want to avoid the escaped \%. I thought that this would do it

re.sub(r"([^%]*)([^\\][%])(.*)$", r"\1", "10 \% foo.% bar")

which outputs the almost right

'10 \\% foo'

expected output:

'10 \\% foo.'

Why does it trim away the last character before %? And, how can I avoid it?


Solution

  • Your problem is your regex matches [zero or more non-percent characters (group 1)], then it matches [a non-backslash character and a percent character (group 2)].

    You replace this entire match with the first group, so you miss out the non-backslash character in group 2

    Instead, use a negative lookbehind, which only matches percent characters without a backslash before them, and then everything until the rest of the line Try it:

    (?<!\\)%.*$
    

    In python:

    >>> re.sub(r"(?<!\\)%.*$", "", "10 \% foo.% bar")
    '10 \\% foo.'
    

    With a multi-line string, use the re.M flag:

    >>> ss = """10 \% foo.% bar"
    Hello world
    Hello world % this is a comment
    % This is also a comment
    """
    >>> print(re.sub(r"(?<!\\)%.*$", "", ss, flags=re.M))
    10 \% foo.
    Hello world
    Hello world