Search code examples
phpregexhtml-parsingsrctext-extraction

Trying to match src part of HTML <img> tag Regular Expression


I've got a bunch of strings already separated from an HTML file, examples:

<img alt="" src="//i.imgur.com/tApg8ebb.jpg" title="Some manly skills for you guys&lt;p&gt;&lt;span class='points-q7Vdm'&gt;18,736&lt;/span&gt;&nbsp;&lt;span class='points-text-q7Vdm'&gt;points&lt;/span&gt;  : 316,091 views&lt;/p&gt;">

<img src="//i.imgur.com/SwmwL4Gb.jpg" width="48" height="48">

<img src="//s.imgur.com/images/blog_rss.png">

I am trying to make a regular expression that will grab the src="URL" part of the img tag so that I can replace it later based on a few other conditions. The many instances of quotation marks are giving me the biggest problem, I'm still relatively new with Regex, so a lot of the tricks are out of my knowledge,

Thanks in advance


Solution

  • Use DOM or another parser for this, don't try to parse HTML with regular expressions.

    Example:

    $html = <<<DATA
    <img alt="" src="//i.imgur.com/tApg8ebb.jpg" title="Some manly skills for you guys&lt;p&gt;&lt;span class='points-q7Vdm'&gt;18,736&lt;/span&gt;&nbsp;&lt;span class='points-text-q7Vdm'&gt;points&lt;/span&gt;  : 316,091 views&lt;/p&gt;">
    <img src="//i.imgur.com/SwmwL4Gb.jpg" width="48" height="48">
    <img src="//s.imgur.com/images/blog_rss.png">
    DATA;
    
    $doc = new DOMDocument();
    $doc->loadHTML($html); // load the html
    
    $xpath = new DOMXPath($doc);
    $imgs  = $xpath->query('//img');
    
    foreach ($imgs as $img) {
       echo $img->getAttribute('src') . "\n";
    }
    

    Output

    //i.imgur.com/tApg8ebb.jpg
    //i.imgur.com/SwmwL4Gb.jpg
    //s.imgur.com/images/blog_rss.png
    

    If you would rather store the results in an array, you could do..

    foreach ($imgs as $img) {
       $sources[] = $img->getAttribute('src');
    }
    
    print_r($sources);
    

    Output

    Array
    (
        [0] => //i.imgur.com/tApg8ebb.jpg
        [1] => //i.imgur.com/SwmwL4Gb.jpg
        [2] => //s.imgur.com/images/blog_rss.png
     )