Search code examples
phphtmlparsingdomdocument

How to prevent the PHP DOMDocument from "fixing" your HTML string


I have been trying to parse webpages by use of the HTML DOMObject in order to use them for an application to scan them for SEO quality.

However I have run into a bit of a problem. For testing purposes I've written a small HTML page containing the following incorrect HTML:

<head>
<meta name="description" content="randomdesciption">
</head>
<title>sometitle</title>

As you can see the title is outside the head tag which is the error I am trying to detect.

Now comes the problem, when I use cURL to catch the response string from this page then send it to the DOM document to load it as HTML it actually fixes this by ADDING another <head> and </head> tags around the title.

<head>
<meta name="description" content="randomdesciption">
</head>
<head><title>sometitle</title></head>

I have checked the cURL response data and that in fact is not the problem, somehow the PHP DOMDocument during the execution of the loadHTML() method fixes the html syntax.

I have also tried turning off the DOMDocument recover, substituteEntities and validateOnParse attributes by setting them to false, without success.

I have been searching google but I am unable to find any answers so far. I guess it is a bit rare for some one that actually want the broken HTML not being fixed.

Anyone know how to prevent the DOMDocument from fixing my broken HTML?


Solution

  • UPDATE: as of PHP 5.4 you can use HTML_PARSE_NO_IMPLIED

    $dom->loadHTML($html, LIBXML_HTML_NOIMPLIED);
    

    Original answer below

    You cant. In theory there is a flag HTML_PARSE_NO_IMPLIED for that in libxml to prevent adding implied markup, but its not accessible from PHP.

    On a sidenote, this particular behavior seems to depend on the LIBXML_VERSION used.

    Running this snippet:

    <?php
    $html = <<< HTML
    <head>
    <meta name="description" content="randomdesciption">
    </head>
    <title>sometitle</title>
    HTML;
    
    $dom = new DOMDocument;
    $dom->loadHTML($html);
    $dom->formatOutput = true;
    echo $dom->saveHTML(), LIBXML_VERSION;
    

    on my machine will give

    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
    <html>
    <head><meta name="description" content="randomdesciption"></head>
    <title>sometitle</title>
    </html>
    20707