Search code examples
phphtmlpurifier

htmlpurifier, overpurification of third party source


UPDATE 2: http://htmlpurifier.org/phorum/read.php?3,5088,5113 Author has already identified the problem.

UPDATE: Issue appears to be exclusive to version 4.2.0. I have downgraded to 4.1.0 and it works. Thank you for all your help. Author of package notified.

I am scraping some pages like:

http://form.horseracing.betfair.com/horse-racing/010108/Catterick_Bridge-GB-Cat/1215

According to W3C validation it is valid XHTML Strict.

I am then using http://htmlpurifier.org/ to purify the HTML before loading into a DOMDocument. However it is only returning a single line of content.

Output:

12:15 Catterick Bridge - Tuesday 1st January 2008 - Timeform | Betfair

Code:

echo $content; # all good
$purifier = new \HTMLPurifier();
$content = $purifier->purify($content);
echo $content; # all bad

BTW it works for data sourced from another site, just as you say leaves the title for all pages from this domain.

Related Links


Solution

  • You should not need the HTML purifier. The DOMDocument class will take care of everything for you. However, it will trigger a warning on invalid html, so just do this:

    $doc = new DOMDocument();
    @$doc->loadHTML($content);
    

    Then the error will not be triggered, and you can do what you wish with the HTML.

    If you are scraping links, I would recommend that you use SimpleXMLElement::xpath(); That is much easier than working with the DOMDocument. Another example on that:

    $xml = new SimpleXMLElement($content);
    $result = $xml->xpath('a/@href');
    
    print_r($result);
    

    You can get much more complex xpaths that allow you to specifiy class names, ids, and other attributes. This is much more powerful than DOMDocument.