Search code examples
phpstringurlmeta

Extracting the title and abstract from a webpage


I am trying to extract the title and abstract from arXiv pages, for example http://arxiv.org/abs/1207.0102, my code currently looks like

function get_title($url){
  $str = file_get_contents($url);
  if(strlen($str)>0){
    $str = trim(preg_replace('/\s+/', ' ', $str)); // supports line breaks inside <title>
    preg_match("/\<title\>(.*)\<\/title\>/i",$str,$title); // ignore case
    return $title[1];
  }
}

echo get_title("http://arxiv.org/abs/1207.0102");

When I run this code, this error comes up

Warning: file_get_contents(http://arxiv.org/abs/1207.0102): failed to open stream: HTTP request failed! HTTP/1.1 403 Forbidden in C:\wamp\www\mysite\Index.php

This problem doesn't happen when I try different urls for example http://www.washingtontimes.com/.

Does anyone know why this happens?

Also, is it possible to extract the abstract from this webpage?


Solution

  • It is the response of the website that don't allow empty user agents:

    HTTP/1.1 403 Forbidden
    
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
    <html>
    <head><title>403 Forbidden</title></head>
    <body>
    <h1>Access Denied</h1>
    
     <p>Sadly, your client does not supply a proper User-Agent,
     and is consequently excluded.</p>
     <p>We have an inordinate number of problems with automated scripts
     which do not supply a User-Agent, and violate the automated access
     guidelines posted at arxiv.org
     -- hence we now exclude them all.</p>
     <p>(In rare cases, we have found that accesses through proxy servers
     strip the User-Agent information. If this is the case, you need to contact
     the administrator of your proxy server to get it fixed.)</p>
    
    
    <p>If you believe this determination to be in error, see
    <b>http://arxiv.org/denied.html</b> for additional information.</p>
    </body>
    </html>
    

    If you use for example the user agent "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko" in your request, it will work:

    $options = array(
      'http'=>array(
        'method'=>"GET",
        'header'=>"User-Agent: Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko\r\n"
      )
    );
    $context = stream_context_create($options);
    $str = file_get_contents($url, false, $context);