Search code examples
phpfacebookdomhrefxpath

Why is this Xpath Query not working on the DOM of facebook application pages?


I dont understand why my xpath query returns the correct href for the second url but not the first url. The HTML code looks the same. It contains the same kind of structure. But somehow no href is returned. (I just comment out each one of the $url's to test it)

$url = "http://apps.facebook.com/TexasHoldEmPoker/"; // this one does not work
//$url = "http://nu.nl"; // this one works

$response = wp_remote_get($url);
$data = $response['body'];
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->strictErrorChecking = false;
$href='';
if (!$dom->loadHTML($data))
{
    foreach (libxml_get_errors() as $error)
    {
    }
    libxml_clear_errors();
}
else
{
    $xpath = new DOMXPath($dom);
    $elements = $xpath->query("/html/head/link[@rel='shortcut icon']");

    if (!is_null($elements))
    {
        foreach ($elements as $element)
        {
            if ($element->getAttribute('href'))
            {
                $href = $element->getAttribute('href');
            }
        }
    }
}
echo $href;

So I know the code is working correct for "nu.nl" but somehow not for the facebook apps pages. I cant grasp why since the structure is the same.

p.s. : full code here: http://plugins.svn.wordpress.org/wp-favicons/trunk/plugins/sources/page.php


Solution

  • Take a look at $dom->saveXML() .

    You'll see that the <link>-element is a child of body, not of head like expected.

    So the xpath should be:

    /html/body/link[@rel='shortcut icon']
    

    or

    //link[@rel='shortcut icon']
    

    I guess the different markup is a result of the parser when trying to fix the illegal <noscript> inside the <head>(everything inside the head after and including this <noscript> has been moved to the <body>)