Search code examples
htmlregexpreg-match

Regex that makes sure a match starts with a string


I am running a regex on some HTML and need to extract some image title tags.

The image title tags look like this:

title="Image Title Here"

And this works for the task:

(?<=title=").*?(?=")

However the problem is that it will grab unwanted title tags also. I noticed though in the HTML i run the regex on the images are inside h3 tags.

How can i update my regex to make sure it only gets matches from html starting with '

My current regex is:

(?<=<h3).*(?<=title=").*?(?=")

Solution

  • Using a DOMDocument with XPath should be less error prone:

    $html = <<<DATA
    <body>
    <h1>Text 1<img title="Not this"></h1>
    <h2>Text 2<img title="Not this"></h2>
    <h3>Text 3<img title="This"></h3>
    </body>
    DATA;
    
    $dom = new DOMDocument('1.0', 'UTF-8');
    $dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
    
    $xpath = new DOMXPath($dom);
    $imgs = $xpath->query('//h3/img[@title]');
    $res = array();
    foreach($imgs as $img) { 
       array_push($res, $img->getAttribute('title'));
    }
    
    print_r($res);
    

    See the PHP demo

    The '//h3/img[@title]' xpath expression will find all h3 tags that contain img children that contain title attributes, and $img->getAttribute('title') will get the value from these attributes.