Search code examples
pythonregexmetacharacters

Escape all metacharacters in Python


I need to search for patterns which may have many metacharacters. Currently I use a long regex.

prodObjMatcher=re.compile(r"""^(?P<nodeName>[\w\/\:\[\]\<\>\@\$]+)""", re.S|re.M|re.I|re.X)

(my actual pattern is very long so I just pasted some relevant portion on which I need help)

This is especially painful when I need to write combinations of such patterns in a single re compilation.

Is there a pythonic way for shortening the pattern length?


Solution

  • Look, your pattern can be reduced to

    r"""^(?P<nodeName>[]\w/:[<>@$]+).*?"""
    

    Note that you do not have to ever escape any non-word character in the character classes, except for shorthand classes, ^, -, ], and \. There are ways to keep even those (except for \) unescaped in the character class:

    • ] at the start of the character class
    • - at the start/end of the character class
    • ^ - should only be escaped if you place it at the start of the character class as a literal symbol.

    Outside a character class, you must escape \, [, (, ), +, $, ^, *, ?, ..

    Note that / is not a special regex metacharacter in Python regex patterns, and does not have to be escaped.

    Use raw string literals when defining your regex patterns to avoid issues (like confusing word boundary r'\b' and a backspace '\b').