Search code examples
phpregexnestedforum

How can I capture and format on output nested format tags?


I'm working on a forum system that parses BBCode like [b]some bold text[/b] and applies HTML formatting to it when output via PHP. All of my expressions work, but I'm having trouble figuring out how to deal with a certain scenario, specifically regarding nested quote blocks.

On a forum you might have one user quote another user. I have been successful at formatting this using:

#\[quote="(.*?);(\w*?)"\]\s*(.*?)\s*\[\/quote\]#

and calling preg_replace() to replace it with:

<blockquote id="quote-$2"><p>$3<br> - $1</p></blockquote> Here is a working example.

For a real example you might see on a forum, a user, Stan, wants to quote John, adding this to a textarea for submission:

[quote="John;2"]John's sentence[/quote] 
____________

Stan's reply

But what happens if John had quoted Mary in his post?

[quote="John;2"][quote="Mary;1"]Mary's sentence[/quote]John's sentence[/quote]
____________

Stan's reply

My regex will capture all but the last [/quote], but even if I was able to capture the whole string I'm not sure how I'd be able to format it. Ideally, I'd like the output to look something like this:

    "Mary's sentence"          
        - Mary

"John's sentence"
    - John
__________________________

Stan's reply

In HTML:

<blockquote id="quote-2">
    <blockquote id="quote-1"><p>"Mary's sentence"<br> - Mary</p></blockquote>
        <p>"John's sentence"<br> - John</p>
</blockquote> 
<p>Stan's reply</p>

Can I capture and format repeated nested tags using regex? What if there are 100 nested quote blocks? Obviously I can just write a ridiculously long and repetitive expression (which certainly would have limitations), but there has to be a better way to tackle this. Is there another method I should use?

I'm sorry if a similar question already exists, but I have looked through many questions on SO and am still not sure which approach I should take.


Solution

  • The idea is to make sure you only match the innermost BB tag. Match all text between [quote and [/quote] that dooes not contain another [quote=, and replace until no such match is found. It is also based on an assumption you have no [quote= in your actual tag contents, but in most cases it is true. Another assumption is that the attributes are "-quoted and there cannot be other double quotes inside.

    So, you may use

    $s = '[quote="John;2"][quote="Mary;1"]Mary\'s sentence[/quote]John\'s sentence[/quote]';
    $repl = '<blockquote id="quote-$2"><p>$3 <br> - $1</p></blockquote>';
    $reg = '~\[quote="([^"]*);(\w*)"]\s*((?:(?!\[quote=).)*?)\s*\[/quote]~si';
    while (preg_match($reg, $s)) {
        $s = preg_replace($reg, $repl, $s);
    }
    echo $s;
    // => <blockquote id="quote-2"><p><blockquote id="quote-1"><p>Mary's sentence <br> - Mary</p></blockquote>John's sentence <br> - John</p></blockquote>
    

    See the PHP demo. The regex is

    '~\[quote="([^"]*);(\w*)"]\s*((?:(?!\[quote=).)*?)\s*\[/quote]~si'
    

    See the regex demo.

    Details

    • \[quote=" - a literal substring
    • ([^"]*) - Capturing group 1: any 0+ chars other than "
    • ; - a colon
    • (\w*) - Capturing group 2: 0+ word chars
    • "] - a literal substring
    • \s* - 0+ whitespaces
    • ((?:(?!\[quote=).)*?) - Capturing group 3: any char, as few as possible, not starting [quote= text
    • \s* - 0+ whitespaces
    • \[/quote] - a literal [/quote] substring.

    Pretty-printing is an extra task, there are a couple of solutions mentioned here.