Search code examples
htmlxmlregexsgml

Regex for an open SGML node that contains <p> , </p>, and <br /> tags


I have some SGML that I'm trying to clean up by adding closing tags to the opening ones. Right now, the document has a structure like this:

<CAT>
<NAME>Daniel
<COLOR>White
<DESC>Daniel is a white cat <p>He was born in July</p><br />He's super cute.<p><br />He does not have any siblings.
<COUNTRY>USA
</CAT>

So far I can match an open tag and capture the content as a group using this regexp: <NAME>([^\\<]+)[^<] if doesn't have any <p>, </p>, or <br /> elements within the content area.

But if i do <DESC>([^\\<]+)[^<], the pattern matching stops right before the first <p>

The reason why I'm using < as the end of the pattern is because all the other open nodes don't have html elements that stop the matching

How can I make a regexp that matches the <DESC> node that includes <p>, </p>, <br /> and ends before the <COUNTRY> node?


Solution

  • How about this:

    <DESC>((?:</?p>|<br />|[^\\<])+)
    

    This allows these three tags to match and stops at the next < that doesn't belong to one of the three.

    By the way, why aren't you allowing the backslash as a valid character?