Search code examples
pythonregexlookbehind

Positive lookbehind with a matching group to be extracted


testString = ("<h2>Tricks</h2>"
              "<a href=\"#\"><i class=\"icon-envelope\"></i></a>")
import re
re.sub("(?<=[<h2>(.+?)</h2>\s+])<a href=\"#\"><i class=\"icon-(.+?)\"></i></a>", "{{ \\1 @ \\2 }}", testString)

This produces: invalid group reference.

Making the replacement take only \\1, only extracts envelope, that makes me think that the lookbehind is ignored. Is there a way to extract something from lookbehind?

I'm looking forward to produce:

<h2>Tricks</h2>
{{ Tricks @ envelope }}

Solution

  • Looks like you really want to use a HTML parser instead. Mixing Regular expressions and HTML get's real painful, really really fast.

    In your regular expression, you created a character class (a set of characters that is allowed to match) consisting of <, h, 2, >, etc. here:

    [<h2>(.+?)</h2>\s+]
    

    which could have been written as:

    [<>h2()+.?/\s]
    

    and it would match the same characters.

    Don't use [..] unless you want to create a set of characters for a match (\s, \d, etc. are pre-built character classes).

    However, even if you were to remove the brackets, the lookbehind is not allowed. You are not allowed to use variable-width patterns in a lookbehind (no + or *). So, with the character class the lookbehind no longer matches what you think it matches, without it the lookbehind is not permissable.

    All in all, just just BeautifulSoup instead.