Match %, avoid match of \%

I am trying to remove comments from my TeX code. I want to trim text after %, but want to avoid the escaped \%. I thought that this would do it

re.sub(r"([^%]*)([^\\][%])(.*)$", r"\1", "10 \% foo.% bar")

which outputs the almost right

'10 \\% foo'

expected output:

'10 \\% foo.'

Why does it trim away the last character before %? And, how can I avoid it?

Solution

Your problem is your regex matches [zero or more non-percent characters (group 1)], then it matches [a non-backslash character and a percent character (group 2)].

You replace this entire match with the first group, so you miss out the non-backslash character in group 2

Instead, use a negative lookbehind, which only matches percent characters without a backslash before them, and then everything until the rest of the line Try it:

(?<!\\)%.*$

In python:

>>> re.sub(r"(?<!\\)%.*$", "", "10 \% foo.% bar")
'10 \\% foo.'

With a multi-line string, use the re.M flag:

>>> ss = """10 \% foo.% bar"
Hello world
Hello world % this is a comment
% This is also a comment
"""
>>> print(re.sub(r"(?<!\\)%.*$", "", ss, flags=re.M))
10 \% foo.
Hello world
Hello world