I am running a regex on some HTML and need to extract some image title tags.
The image title tags look like this:
title="Image Title Here"
And this works for the task:
(?<=title=").*?(?=")
However the problem is that it will grab unwanted title tags also. I noticed though in the HTML i run the regex on the images are inside h3 tags.
How can i update my regex to make sure it only gets matches from html starting with '
My current regex is:
(?<=<h3).*(?<=title=").*?(?=")
Using a DOMDocument
with XPath
should be less error prone:
$html = <<<DATA
<body>
<h1>Text 1<img title="Not this"></h1>
<h2>Text 2<img title="Not this"></h2>
<h3>Text 3<img title="This"></h3>
</body>
DATA;
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
$imgs = $xpath->query('//h3/img[@title]');
$res = array();
foreach($imgs as $img) {
array_push($res, $img->getAttribute('title'));
}
print_r($res);
See the PHP demo
The '//h3/img[@title]'
xpath expression will find all h3
tags that contain img
children that contain title
attributes, and $img->getAttribute('title')
will get the value from these attributes.