Search code examples
pythonregexssi

Capture a Repeating Group in Python using RegEx (see example)


I am writing a regular expression in python to capture the contents inside an SSI tag.

I want to parse the tag:

<!--#include file="/var/www/localhost/index.html" set="one" -->

into the following components:

  • Tag Function (ex: include, echo or set)
  • Name of attribute, found before the = sign
  • Value of attribute, found in between the "'s

The problem is that I am at a loss on how to grab these repeating groups, as name/value pairs may occur one or more times in a tag. I have spent hours on this.

Here is my current regex string:

^\<\!\-\-\#([a-z]+?)\s([a-z]*\=\".*\")+? \-\-\>$

It captures the include in the first group and file="/var/www/localhost/index.html" set="one" in the second group, but what I am after is this:

group 1: "include"
group 2: "file"
group 3: "/var/www/localhost/index.html"
group 4 (optional): "set"
group 5 (optional): "one"

(continue for every other name="value" pair)


I am using this site to develop my regex


Solution

  • Grab everything that can be repeated, then parse them individually. This is probably a good use case for named groups, as well!

    import re
    
    data = """<!--#include file="/var/www/localhost/index.html" set="one" reset="two" -->"""
    pat = r'''^<!--#([a-z]+) ([a-z]+)="(.*?)" ((?:[a-z]+?=".+")+?) -->'''
    
    result = re.match(pat, data)
    result.groups()
    ('include', 'file', '/var/www/localhost/index.html', 'set="one" reset="two"')
    

    Then iterate through it:

    g1, g2, g3, g4 = result.groups()
    for keyvalue in g4.split(): # split on whitespace
        key, value = keyvalue.split('=')
        # do something with them