Search code examples
phphtmlparsingsax

Parsing of badly formatted HTML in PHP


In my code I convert some styled xls document to html using openoffice. I then parse the tables using xml_parser_create. The problem is that openoffice creates oldschool html with unclosed <BR> and <HR> tags, it doesn't create doctypes and don't quote attributes <TABLE WIDTH=4>.

The php parsers I know off don't like this, and yield xml formatting errors. My current solution is to run some regexes over the file before I parse it, but this is neither nice nor fast.

Do you know a (hopefully included) php-parser, that doesn't care about these kinds of mistakes? Or perhaps a fast way to fix a 'broken' html?


Solution

  • A solution to "fix" broken HTML could be to use HTMLPurifier (quoting) :

    HTML Purifier is a standards-compliant HTML filter library written in PHP.
    HTML Purifier will not only remove all malicious code (better known as XSS) with a thoroughly audited, secure yet permissive whitelist, it will also make sure your documents are standards compliant


    An alternative idea might be to try loading your HTML with [`DOMDocument::loadHTML`][2] *(quoting)* :

    The function parses the HTML contained in the string source . Unlike loading XML, HTML does not have to be well-formed to load.

    And if you're trying to load HTML from a file, see DOMDocument::loadHTMLFile.