Search code examples
objective-cregexhtml-parsinglookbehind

RegEx that removes XHTML line breaks appearing before block-level tags


I need a RegEx that finds extraneous <br /> tags that occur before block tags, leaving all other <br /> tags intact.

Here's the text I am searching:

<div>some text<br id="first"/>some more text<br id="second"/></div>

However, when using the following RegEx:

</? *br.*?>(?=</? *([^(br)]).*?)

It selects everything past the first <br /> tag like so:

<br id="first"/>some more text<br id="second"/>

... Which isn't what I want. How can I modify the expression so it only selects <br id="second"/>?

Notes: All inline tags except <br /> tags are stripped out before this point, so they won't be a factor. Also, I am using Obj-C/Cocoa so I can't use all those fancy PHP functions. :). Also, this will be a valid XHTML doc.


Solution

  • <br[^<>]*>(?=\s*<(?!br))
    

    should do what you want. (See it here)

    Explanation of the regex:

    <br     # Match <br
    [^<>]*  # followed by any number of non-bracket characters
    >       # and a >.
    (?=     # Assert that we are right before...
     \s*    # optional whitespace,
     <      # followed by any tag
     (?!br) # except br
    )       # (End of lookahead)
    

    Some comments:

    • I've removed the optional slashes from your regex because </br> doesn't exist in HTML or XHTML.
    • I've also removed the optional spaces at the start of the tags because there may be no whitespace between < and the tag name (nor may there be whitespace between / and >).
    • As an aside: In valid XHTML, <br /> is the only legal form; <br id="foo" /> is invalid.