Extracting the title and abstract from a webpage

I am trying to extract the title and abstract from arXiv pages, for example http://arxiv.org/abs/1207.0102, my code currently looks like

function get_title($url){
  $str = file_get_contents($url);
  if(strlen($str)>0){
    $str = trim(preg_replace('/\s+/', ' ', $str)); // supports line breaks inside <title>
    preg_match("/\<title\>(.*)\<\/title\>/i",$str,$title); // ignore case
    return $title[1];
  }
}

echo get_title("http://arxiv.org/abs/1207.0102");

When I run this code, this error comes up

Warning: file_get_contents(http://arxiv.org/abs/1207.0102): failed to open stream: HTTP request failed! HTTP/1.1 403 Forbidden in C:\wamp\www\mysite\Index.php

This problem doesn't happen when I try different urls for example http://www.washingtontimes.com/.

Does anyone know why this happens?

Also, is it possible to extract the abstract from this webpage?

Solution

It is the response of the website that don't allow empty user agents:

HTTP/1.1 403 Forbidden

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head><title>403 Forbidden</title></head>
<body>
<h1>Access Denied</h1>

 <p>Sadly, your client does not supply a proper User-Agent,
 and is consequently excluded.</p>
 <p>We have an inordinate number of problems with automated scripts
 which do not supply a User-Agent, and violate the automated access
 guidelines posted at arxiv.org
 -- hence we now exclude them all.</p>
 <p>(In rare cases, we have found that accesses through proxy servers
 strip the User-Agent information. If this is the case, you need to contact
 the administrator of your proxy server to get it fixed.)</p>


<p>If you believe this determination to be in error, see
<b>http://arxiv.org/denied.html</b> for additional information.</p>
</body>
</html>

If you use for example the user agent "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko" in your request, it will work:

$options = array(
  'http'=>array(
    'method'=>"GET",
    'header'=>"User-Agent: Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko\r\n"
  )
);
$context = stream_context_create($options);
$str = file_get_contents($url, false, $context);