Search code examples
phphtmltagsmeta

Meta description being returned in wrong language using php


I wonder if someone could shed some light on a issue i am experiencing. I am building an SEO tool that looks at a websites title and description meta tag. What I have experienced is that using

<?php

$tags = get_meta_tags("https://twitter.com");
echo $tags['description'];
?>

I am getting the description returned in German

"Verbinde Dich sofort mit den Dingen, die für Dich am wichtigsten sind. Folge Freunden, Experten, Lieblingsstars und aktuellen Nachrichten"

and not in English

"Instantly connect to what's most important to you. Follow your friends, experts, favorite celebrities, and breaking news."

I also discovered that Bing.com I also have this issue with too. I tried this with Curl too and got the same result.

This is what my curl code looked like,

<?

$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: "; // browsers keep this blank. 

function file_get_contents_curl($url)
{
$ch = curl_init();

curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HTTPHEADER, $header); 
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

$data = curl_exec($ch);
curl_close($ch);

return $data;
}

$html = file_get_contents_curl("https://twitter.com");

//parsing begins here:
$doc = new DOMDocument();
@$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');

//get and display what you need:
$title = $nodes->item(0)->nodeValue;

$metas = $doc->getElementsByTagName('meta');

for ($i = 0; $i < $metas->length; $i++)
{
$meta = $metas->item($i);
if($meta->getAttribute('name') == 'description')
    $description = $meta->getAttribute('content');
if($meta->getAttribute('name') == 'keywords')
    $keywords = $meta->getAttribute('content');
if($meta->getAttribute('language') == 'language');
    $language = $meta->getAttribute('language');
}

echo "Title: $title". '<br/><br/>';
echo "Description: $description". '<br/><br/>';
echo "Keywords: $keywords";

?>

The curl response is running here => http://www.chillwebdesigns.co.uk/tools/4/test.php

Anyone ever come across this before?


Solution

  • The HTTP request sent by get_meta_tags does not contain the traditional Accept-Language header that normal web browsers send in order to notify the server which language might be appropriate.

    It seems like some sites (e.g. Twitter) will use a geographical IP lookup to determine the content language:

    From my local computer in Sweden

    Koppla direkt upp dig mot det som är viktigast för dig. Följ dina vänner, experter, favoritkändisar, och nyheter.

    From my VPS in London, UK

    Instantly connect to what's most important to you. Follow your friends, experts, favourite celebrities, and breaking news.

    So, it seems that if you intend to only look at English meta-data you would need to make your script act like an English localised web browser, using Accept-language and possibly other means as well.

    EDIT: Here is an example of how to extract the meta tags by first fetching the HTML using cURL. Details on setting the cURL headers to include Accept-Language.

    Code example:

    <?php
    function file_get_contents_curl($url)
    {
    $ch = curl_init();
    
    $header = array();
    $header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,"; 
    $header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5"; 
    $header[] = "Cache-Control: max-age=0"; 
    $header[] = "Connection: keep-alive"; 
    $header[] = "Keep-Alive: 300"; 
    $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7"; 
    $header[] = "Accept-Language: en-us,en;q=0.5";
    
    curl_setopt($ch, CURLOPT_HTTPHEADER, $header); 
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    
    $data = curl_exec($ch);
    curl_close($ch);
    
    return $data;
    }
    
    $html = file_get_contents_curl("http://twitter.com");
    
    //parsing begins here:
    $doc = new DOMDocument();
    @$doc->loadHTML($html);
    $nodes = $doc->getElementsByTagName('title');
    
    //get and display what you need:
    $title = $nodes->item(0)->nodeValue;
    
    $metas = $doc->getElementsByTagName('meta');
    
    for ($i = 0; $i < $metas->length; $i++)
    {
    $meta = $metas->item($i);
    if($meta->getAttribute('name') == 'description')
        $description = $meta->getAttribute('content');
    if($meta->getAttribute('name') == 'keywords')
        $keywords = $meta->getAttribute('content');
    if($meta->getAttribute('language') == 'language');
        $language = $meta->getAttribute('language');
    }
    
    echo "Title: $title". '<br/><br/>';
    echo "Description: $description". '<br/><br/>';
    echo "Keywords: $keywords";
    
    ?>