Search code examples
phpbbcode

How to parse nested BB Code with parameters


I'd like to work on a BB Code filter for a PHP website. (I'm using CakePHP, it would be a BB Code helper). I have the following requirements:

  • BB Code can be nested. So something like this is valid

    [block]  
        [block]  
        [/block]  
        [block]  
            [block]  
            [/block]  
        [/block]  
    [/block]  
    
  • Bbcodes can have 0 or more parameters.

    Example:

    [video: url="url", width="500", height="500"]Title[/video]
    
  • BB Code might have multiple behaviours

    Let's say [url]text[/url] would be transformed to [url:url="text"]text[/url] or the video BB Code would be able to choose between YouTube, Dailymotion, etc.

I've already done something with regex, but my biggest problem was matching parameters. In fact, I got nested BB Code and BB Code with 0 parameters to work. But when I added a regex match for parameters, it didn't match nested BB Code correctly:

"\[($tag)(=.*)\"\](.*)\[\/\1\]" (It wasn't .* but the non-greedy matcher)

I don't have the complete regex with me right now, But I had something that looked like that(above).

Is there a way to match BB Code with regex or something else?

The only thing I can think of is to use the visitor pattern and to split my text with each possible tags. This way, I can have a bit more of control over my text parsing and I could probably validate my document so if the input text doesn't have valid BB Code... I could notify the user with a error before saving anything.

I would use SableCC to create my text parser.


Solution

  • There's both a pecl and PEAR BBCode parsing library. Software's hard enough without reinventing years of work on your own.

    If neither of those are an option, I'd concentrate on turning the BBCode into a valid XML string, and then using your favorite XML parsing routine on that. Very very rough idea here, but

    1. Run the code through htmlspecialchars to escape any entities that need escaping

    2. Transform all [ and ] characters into < and > respectively

    3. Don't forget to account for the colon in cases like [tagname:

    If the BBCode was nested properly, you should be all set to pass this string into an XML parsing object (SimpleXML, DOMDocument, etc.)