Search code examples
phphtmlparsingsmart-tags

What are smart tags and how do I remove them from html?


So I am still working on this parser. Today I found a document with the tag <st1:place w:st="on"> Google tells me it is a Microsoft Office Smart Tag.

I would like to get rid of these things but I cannot find a list of what they are or how many of them there are?

If they all follow the <...:...> pattern it would be easy to remove with regex.

The document has no doctype and a .jsp extention, but all the content is between two <html> tags, and however non-standard the beast is, I still need to parse it.

OK it is actually not a big issue but it throws off my formatting & bugs me.


Solution

  • This regexp should do the trick:

    /<[:alnum:]+:[\s\S]*>/
    

    It will trigger on any tag that opens with < followed by an alphanumeric pattern followed by a ':' colon.

    Alternatively:

    /<\s*[:alnum:]+:[\s\S]*>/
    

    Would allow looser formatter of the tag (space between the opening < and the namespace)