Search code examples
regexxmlopenxml

Runaway Backtracking when trying to parse XML


I have a problem trying to parse some OpenXML Standard (docx). We use expressions like {Contact.MailAddress} and fill this in from data in a second step. However, the way Word (and LibreOffice) are, is that they sometimes split up these tags like this:

<w:r w:rsidRPr="00E22BCD">
    <w:rPr>
        <w:rFonts w:eastAsia="Times New Roman"/>
        <w:lang w:val="fr-CH"/>
    </w:rPr>
    <w:t>{</w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
<w:r w:rsidRPr="00E22BCD">
    <w:rPr>
        <w:rFonts w:eastAsia="Times New Roman"/>
        <w:lang w:val="fr-CH"/>
    </w:rPr>
    <w:t>Contakt.MailAddress</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
<w:r w:rsidRPr="00E22BCD">
    <w:rPr>
        <w:rFonts w:eastAsia="Times New Roman"/>
        <w:lang w:val="fr-CH"/>
    </w:rPr>
    <w:t>}</w:t>
</w:r>

So I made the following regex:

(?<expr>{)((?<tag><[^>]+>)|(?<expr>[\w\s.]+))+(?<expr>})

Everything in the expr groups is part of the {Contact.MailAddress} expression and gets merged together. Everything in the tag groups gets merged into tags to fix the xml together later.

Now, this works pretty well. However, when we use our {foreach} syntax, the xml might get quite big and then we have a runaway case.

Can anyone think of a regex, that captures this better that doesn't result in a runaway?

EDIT 1: The program is written in C#/.NET. For the regex flavor.

EDIT 2: I took another approach: I listed all the matches where there is a match like {[^}]} and within there I replace all tags and whitespaces with nothing:

var matches = Regex.Matches(xml, @"{[^}]+}")
    .Cast<Match>()
    .OrderByDescending(x => x.Index)
    .ToList();

foreach (var match in matches)
{
    var replacement = Regex.Replace(match.Value, @"<[^>]+>", "");
    replacement = Regex.Replace(replacement, @"\s+", "");
    xml = xml.Substring(0, match.Index) + replacement + xml.Substring(match.Index + match.Length);
}

The trick is to order the matches by index decending so the math in Substring works.


Solution

  • It looks like you want to remove all tags and whitespace between { and }. If you're not worried about other braces that shouldn't be matched, this should work:

    s = Regex.Replace(s, 
        @"(?<brace>{)\s*(?:<[^<>]+>\s*)*|\s*(?:<[^<>]+>\s*)*(?<brace>})", 
        @"${brace}");
    

    For safety's sake, you might want to add the nearest actual tags (assuming they're always the same):

    @"(?<brace>{)</w:t>\s*(?:<[^<>]+>\s*)*|\s*(?:<[^<>]+>\s*)*<w:t>(?<brace>})"
    

    With either regex, I get this result:

    <w:r w:rsidRPr="00E22BCD">
        <w:rPr>
            <w:rFonts w:eastAsia="Times New Roman"/>
            <w:lang w:val="fr-CH"/>
        </w:rPr>
        <w:t>{Contakt.MailAddress}</w:t>
    </w:r>
    

    ...and there's no backtracking at all.

    EDIT:

    Turns out the tags are also being inserted before and after the dot inside the braces. My original solution doesn't work for that, so here's a two-stage approach that finds the brace-enclosed text and replaces it with the same text with tags and whitespace removed:

    s = Regex.Replace(s, 
        @"{[^{}]*}", 
        m => Regex.Replace(m.Value, @"\s*(?:<[^<>]+>\s*)*", ""));