Search code examples
phpregexdomreplacesrc

Add domain to <img> src attribute value if a relative path


I have a text variable which contains multiple images with a relative or absolute path. I need to check if the src attribute starts with http or https then ignore it, but in case it starts with / or something like abc/ then prepend a base url.

I tried like below:

<?php
$html = <<<HTML
<img src="docs/relative/url/img.jpg" />
<img src="/docs/relative/url/img.jpg" />
<img src="https://docs/relative/url/img.jpg" />
<img src="http://docs/relative/url/img.jpg" />
HTML;

$base = 'https://example.com/';

$pattern = "/<img src=\"[^http|https]([^\"]*)\"/";
$replace = "<img src=\"" . $base . "\${1}\"";
echo $text = preg_replace($pattern, $replace, $html);

My output is:

<img src="https://example.com/ocs/relative/url/img.jpg" />
<img src="https://example.com/docs/relative/url/img.jpg" />
<img src="https://docs/relative/url/img.jpg" />
<img src="http://docs/relative/url/img.jpg" />

Issue here: I got 99% result correct, but when the src attribute started with something like docs/ then first letter of it cut off. (please check first img src in output)

Output I needed is:

<img src="https://example.com/docs/relative/url/img.jpg" /><!--check this and compare with current result, you will get the difference -->
<img src="https://example.com/docs/relative/url/img.jpg" />
<img src="https://docs/relative/url/img.jpg" />
<img src="http://docs/relative/url/img.jpg" />

Could any one help me to rectify it.


Solution

  • The following pattern will seek src attributes that do not start with http or https. Then for relative paths that begin with a forward slash, the leading slash will be removed before prepending the $base string to the src value.

    Code: (Demo)

    $base = 'https://example.com/';
    echo preg_replace('~ src="(?!http)\K/?~', $base, $html);
    

    Output:

    <img src="https://example.com/docs/relative/url/img.jpg" />
    <img src="https://example.com/docs/relative/url/img.jpg" />
    <img src="https://docs/relative/url/img.jpg" />
    <img src="http://docs/relative/url/img.jpg" />
    

    Breakdown:

    ~           #starting pattern delimiter
     src="      #match space, s, r, c, =, then "
    (?!http)    #only continue matching if not https or http
    \K          #forget any previously matched characters so they are not destroyed by the replacement string
    /?          #optionally match a forward slash
    ~           #ending pattern delimiter
    

    As for your pattern, /<img src=\"[^http|https]([^\"]*)\"/:

    1. [^http|https] actually means "match a single character that is not from this list: |, h, t, p, and s. It could be simplified to [^|hpst] because the order of the listed characters in the "negated character class" is irrelevant and duplicating characters is meaningless. So you see, [^...] is not how you say "a string starts with something or somethingelse".
    2. Capturing all remaining characters in a substring until the next double quote with the intent to use it again in the replacement is unnecessary. This is why I use \K to pinpoint where $base should be injected instead of ([^\"]*).

    Furthermore, I always recommend the stability of a DOM parser when dealing with a valid HTML document. You can use DOMDocument with XPath to target the qualifying elements and modify the src attributes without regex.

    Code: (Demo)

    $dom = new DOMDocument; 
    $dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
    $xpath = new DOMXPath($dom);
    foreach ($xpath->query("//img[not(starts-with(@src, 'http'))]") as $node) {
        $node->setAttribute('src', $base . ltrim($node->getAttribute('src'), '/'));
    }
    echo $dom->saveHTML();
    

    A related answer: https://stackoverflow.com/a/48837947/2943403