Search code examples
pythonregexparsingmarkup

Regular expression to replace "escaped" characters with their originals


NOTE: I'm not parsing lots of or html or generic html with regex. I know that's bad

TL;DR:

I have strings like

A sentence with an exclamation\! Next is a \* character

Where there are "escaped" characters in the original markup. I wish to replace them with their "originals". And get:

A sentence with an exclamation! Next is a * character

I have a small bit data that I need to extract from some wiki markup.

I'm only dealing with paragraphs/snippets here, so I don't need a big robust solution. In python, I tried a test:

s = "test \\* \\! test * !! **"

r = re.compile("""\\.""") # Slash followed by anything

r.sub("-", s)

This SHOULD yeild:

test - - test * !! **

But it doesn't do anything. Am I missing something here?

Furthermore, I'm not sure how to go about replacing any given escaped character with its original, so I would probably just make a list and sub with specific regexes like:

\\\*

and

\\!

There's probably a much cleaner way to do this, so any help is greatly appreciated.


Solution

  • You are missing something, namely the r prefix:

    r = re.compile(r"\\.") # Slash followed by anything
    

    Both python and re attach meaning to \; your doubled backslash becomes just one backslash when you pass the string value to re.compile(), by which time re sees \., meaning a literal full stop.:

    >>> print """\\."""
    \.
    

    By using r'' you tell python not to interpret escape codes, so now re is given a string with \\., meaning a literal backslash followed by any character:

    >>> print r"""\\."""
    \\.
    

    Demo:

    >>> import re
    >>> s = "test \\* \\! test * !! **"
    >>> r = re.compile(r"\\.") # Slash followed by anything
    >>> r.sub("-", s)
    'test - - test * !! **'
    

    The rule of thumb is: when defining regular expressions, use r'' raw string literals, saving you to have to double-escape everything that has meaning to both Python and regular expression syntax.

    Next, you want to replace the 'escaped' character; use groups for that, re.sub() lets you reference groups as the replacement value:

    r = re.compile(r"\\(.)") # Note the parethesis, that's a capturing group
    r.sub(r'\1', s)          # \1 means: replace with value of first capturing group
    

    Now the output is:

    >>> r = re.compile(r"\\(.)") # Note the parethesis, that's a capturing group
    >>> r.sub(r'\1', s) 
    'test * ! test * !! **'