I have a web-page, for example, http://example.com/some-page
. If I pass this URL to my PHP function, it should grab the title and content of the page. I've tried to grab the title like this:
function page_title($url) {
$page = @file_get_contents($url);
if (preg_match('~<h1 class="page-title">(.*)<\/h1>~is', $page, $matches)) {
return $matches[0];
}
}
echo page_title('http://example.com/some-page');
What is my mistake?
Your function actually works almost. I would propose the DOM parser solution (see below), but before doing that I will point out a few weaknesses in the regular expression and code:
the (.*)
capture group is greedy, i.e. it will catch a string that is as long as possible before a closing </h1>
, even across line breaks (because of the s modifier). So if your document has multiple h1
tags it would capture up until the last one! You could fix this, by making it a lazy capture: (.*?)
the actual page may have other tags, like a span
, inside the title. You might want to improve the regular expression to exclude any tags that surround your title, but PHP has a function strip_tags
for that purpose.
@
prefix, you will maybe miss them. I would suggest removing the @
. You could also check the return value for false.H1
tag contents? A page has often a specific title
tag.The above improvements will give you this code:
function page_title($url) {
$page = file_get_contents($url);
if ($page===false) {
echo "Failed to retrieve $url";
}
if (preg_match('~<h1 class="page-title">(.*?)<\/h1>~is', $page, $matches)) {
return strip_tags($matches[0]);
}
}
Although this works, you will sooner or later bump into a document that has an extra space in the h1
tag, or has another attribute before class
, or has more than one css style, etc... making the match fail. The following regular expression will deal with some of these problems:
~<h1\s+class\s*=\s*"([^" ]* )?page-title( [^"]*)?"[^>]*>(.*?)<\/h1\s*>~is
... but still the class
attribute has to come before any other attributes, and its value must be enclosed in double quotes. Also that could be solved, but the regular expression will become a monster.
The DOM way
Regular expressions are not the ideal way to extract content from HTML. Here is an alternative function based on DOM parsing:
function xpage_title($url) {
// Create a new DOM Document to hold our webpage structure
$xml = new DOMDocument();
// Load the url's contents into the DOM, ignore warnings
libxml_use_internal_errors(true);
$success = $xml->loadHTMLFile($url);
libxml_use_internal_errors(false);
if (!$success) {
echo "Failed to open $url.";
return;
}
// Find first h1 with class 'page-title' and return it's text contents
foreach($xml->getElementsByTagName('h1') as $h1) {
// Does it have the desired class?
if (in_array('page-title', explode(" ", $h1->getAttribute('class')))) {
return $h1->textContent;
}
}
}
The above could be still improved by making use of DOMXpath
.
EDIT
You mentioned in comments you actually don't want the contents of the H1
tag because it contains more text than you want.
Then you could read the title
tag and the article
tag contents:
function page_title_and_content($url) {
$page = file_get_contents($url);
if ($page===false) {
echo "Failed to retrieve $url";
}
// PHP 5.4: $result = (object) ["title" => null, "content" => null];
$result = new stdClass();
$result->title = null;
$result->content = null;
if (preg_match('~\<title\>(.*?)\<\/title\>~is', $page, $matches)) {
$result->title = $matches[1];
}
if (preg_match('~<article>(.*)<\/article>~is', $page, $matches)) {
$result->content = $matches[1];
}
return $result;
}
$result = page_title_and_content('http://www.example.com/example');
echo "title: " . $result->title . "<br>";
echo "content: <br>" . $result->content . "<br>";
The above code will return an object with two properties: title and content. Note that the content property will have HTML tags, with potentially images and such. If you don't want tags, then apply strip_tags
.