Search code examples
phpdomdomdocumentgetattribute

An empty attribute in DOM returns an unexpected fallback value


I have retrieved the content of this webpage http://www.dw.com/ar/تقرير-استخباري-اميركي-القاعدة-تسيطر-على-غرب-العراق/a-2251369 and saved it into $webpage.

PLEASE NOTE:

In this webpage, there are a number of <meta> tags. One of those meta-tags is the culprit and is causing some problems. This meta-tag is <meta property="og:description" content="" />. Note that the value of content is an empty string.

I am reading the content of webpage as follows:

<?php

$url = 'http://www.dw.com/ar/تقرير-استخباري-اميركي-القاعدة-تسيطر-على-غرب-العراق/a-2251369';

$webpage = file_get_contents($url);

$og_entry_title = "";
$og_entry_content = "";

$doc = new DOMDocument;
$doc->loadHTML($webpage);

$meta_tags = $doc->getElementsByTagName('meta');

foreach ($meta_tags as $meta_tag) {

    if ($meta_tag->getAttribute('property') == 'og:title') {
        $og_entry_title = $meta_tag->getAttribute('content');
    }

    if ($meta_tag->getAttribute('property') == 'og:description') {
        $og_entry_content = $meta_tag->getAttribute('content');
    }

}

// print the results
echo
'$og_entry_title: ' . $og_entry_title
.PHP_EOL.
'$og_entry_content: ' . $og_entry_content;

When I finish, I have the following values for $og_entry_title and $og_entry_content:

$og_entry_title: TOP STORIES | DW.COM
$og_entry_content: News and analysis of the top international and European topics Current affairs and background information on poltics, business, science, culture, globalization and the environment.

Please note the following in the result:

$og_entry_title is correct and contains the page title, so no problem here

$og_entry_content gives a different value from what I was expecting. I would expect an empty string to be saved in $og_entry_content; however the string "News and analysis of the top international and European topics Current affairs and background information on poltics, business, science, culture, globalization and the environment." is saved. This string appears to be a fallback value (or default value) that is returned whenever a metatag contains an empty string.

After further investigation, it turned out that the go:description is getting its meta-tag value from the http://www.dw.com webpage. It seems that this happened because my webpage contained an empty string, The returned value is retrieved from the root page of the site.

I have the following questions about $og_entry_content:

  1. How do I ensure that the empty string (not the fallback value) is saved into $og_entry_content?

  2. Why is this fallback value from the root page being returned anyway?

Thanks.


Solution

  • Answer

    Your web address has special characters in it that need to be URL encoded.


    Explanation

    First of all, the assumption that...

    $og_entry_title is correct and contains the page title, so no problem here

    ...is wrong.

    This title:

    <meta property="og:title" content="تقرير استخباري اميركي: القاعدة تسيطر على غرب العراق | أخبار | DW.COM | 28.11.2006" />
    

    is not the same as this title:

    <meta property="og:title" content="TOP STORIES | DW.COM" />
    

    Secondly, most modern browsers are awesome enough to do URL encoding on the fly and still display the special characters in the address bar.

    You can see the response headers from the web server for more information.

    <?php
    $url = 'http://www.dw.com/ar/تقرير-استخباري-اميركي-القاعدة-تسيطر-على-غرب-العراق/a-2251369';
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, "$url");
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_VERBOSE, 1);
    curl_setopt($ch, CURLOPT_HEADER, 1);
    $response = curl_exec($ch);
    
    // Then, after your curl_exec call:
    $header_size = curl_getinfo($ch, CURLINFO_HEADER_SIZE);
    echo '
    header
    ------
    '.substr($response, 0, $header_size);
    

    The results show that it doesn't recognize the association between the URL and that page:

    header
    ------
    HTTP/1.1 301 Moved Permanently
    Server: Apache-Coyote/1.1
    Location: /
    Content-Length: 0
    Accept-Ranges: bytes
    X-Varnish: 99639238
    Date: Thu, 16 Jun 2016 15:42:51 GMT
    Connection: keep-alive
    

    HTTP Response Code 301 is a notice to (permanently) redirect to another page. Location: / indicates that you should just go to the home page. This is a common sloppy practice to just send someone to the home page when they don't know what to do with you.

    Curl won't follow redirects by default, which is how we're able to examine the 301 response header. But file_get_contents will follow redirects, which is why you're getting different content than you expect. (With possible exceptions: there is a bug report where some notice that it doesn't always follow redirects.)

    Note that the home page does have content in its og:description:

    <?php
    echo file_get_contents('http://www.dw.com/ar/تقرير-استخباري-اميركي-القاعدة-تسيطر-على-غرب-العراق/a-2251369');
    

    Results in this output:

    ...

    <meta property="og:description" content="News and analysis of the top international and European topics Current affairs and background information on poltics, business, science, culture, globalization and the environment. " />
    

    ...

    <meta property="og:title" content="TOP STORIES | DW.COM" />
    

    ...


    Solution

    First thing you need to do is rawurlencode the web address:

    $url = rawurlencode($url);
    

    Then realize that rawurlencode is poorly named because a valid URL will contain the HTML protocol http:// or https:// and could also contain slashes to delimit parts. This is problematic because rawurlencode will convert colons : to %3A and slashes / to %2F which makes for an invalid URL like http%3A%2F%2Fwww.dw.com%2Far%2F.... It should have been named rawurlencode_parts_of_URL, but they didn't ask me :) And to quote Phil Karlton in their defense:

    There are only two hard things in Computer Science: cache invalidation and naming things.

    So convert the slashes and colons back to their original form:

    $url = str_replace('%3A',':',str_replace('%2F','/',$url));
    

    Finally, the last thing you need to do is send a header to your clients to let them know what kind of font encoding to expect.

    header("content-type: text/html; charset=utf-8");
    

    Otherwise, your clients might be reading some gobbledygook that could look something like this:

    تقرير استخباري اميركي: القاعدة تسيطر على غرب العراÙ


    Final Product

    <?php
    
    // let's see error output on screen while in development
    // remove these lines for production, and use log files only
    error_reporting(-1);
    ini_set('display_errors', 'On');
    
    $url = 'http://www.dw.com/ar/تقرير-استخباري-اميركي-القاعدة-تسيطر-على-غرب-العراق/a-2251369';
    
    // URL encode special chars
    $url = rawurlencode($url);
    
    // fix colons and slashses for valid URL
    $url = str_replace('%3A',':',str_replace('%2F','/',$url));
    
    // make request
    $webpage = file_get_contents($url);
    
    $og_entry_title = "";
    $og_entry_content = "";
    
    $doc = new DOMDocument;
    $doc->loadHTML($webpage);
    
    $meta_tags = $doc->getElementsByTagName('meta');
    
    foreach ($meta_tags as $meta_tag) {
    
        if ($meta_tag->getAttribute('property') == 'og:title') {
            $og_entry_title = $meta_tag->getAttribute('content');
        }
    
        if ($meta_tag->getAttribute('property') == 'og:description') {
            $og_entry_content = $meta_tag->getAttribute('content');
        }
    
    }
    
    // set the character set for the client
    header("content-type: text/html; charset=utf-8");
    
    // print the results
    echo
    '$og_entry_title: ' . $og_entry_title
    .PHP_EOL.
    '$og_entry_content: ' . $og_entry_content;
    

    Results in this output:

    $og_entry_title: تقرير استخباري اميركي: القاعدة تسيطر على غرب العراق | أخبار | DW.COM | 28.11.2006
    $og_entry_content:
    

    Addendum

    If you're looking at your error logs, and you really should always be looking at your error logs when developing, then you'll notice a litany of warnings:

    Warning: DOMDocument::loadHTML(): htmlParseStartTag: misplaced <html> tag in Entity, line: 4 in ...
    
    Warning: DOMDocument::loadHTML(): htmlParseStartTag: misplaced <html> tag in Entity, line: 5 in ...
    
    Warning: DOMDocument::loadHTML(): htmlParseStartTag: misplaced <html> tag in Entity, line: 6 in ...
    
    Warning: DOMDocument::loadHTML(): htmlParseStartTag: misplaced <html> tag in Entity, line: 7 in ...
    
    Warning: DOMDocument::loadHTML(): ID topMetaInner already defined in Entity, line: 300 in ...
    
    Warning: DOMDocument::loadHTML(): ID langSelectTrigger already defined in Entity, line: 315 in ...
    
    Warning: DOMDocument::loadHTML(): htmlParseEntityRef: no name in Entity, line: 546 in ...
    
    Warning: DOMDocument::loadHTML(): htmlParseEntityRef: no name in Entity, line: 546 in ...
    
    Warning: DOMDocument::loadHTML(): htmlParseEntityRef: no name in Entity, line: 548 in ...
    
    Warning: DOMDocument::loadHTML(): htmlParseEntityRef: no name in Entity, line: 548 in ...
    

    This is because you're trying to use the DOMDocument class with in-valid HTML and not well-formed XML documents. But this is a topic for a different question.