Search code examples
phpurlstr-replacerelative-pathabsolute-path

Replace all relative URLs with absolute URLS


I've seen a few answers (like this one), but I have some more complex scenarios I'm not sure how to account for.

I essentially have full HTML documents. I need to replace every single relative URL with absolute URLs.

Elements from potential HTML look as follows, may be other cases as well:

<img src="/relative/url/img.jpg" />
<form action="/">
<form action="/contact-us/">
<a href='/relative/url/'>Note the Single Quote</a>
<img src="//example.com/protocol-relative-img.jpg" />

Desired Output would be:

// "//example.com/" is ideal, but "http(s)://example.com/" are acceptable

<img src="//example.com/relative/url/img.jpg" />
<form action="//example.com/">
<form action="//example.com/contact-us/">
<a href='//example.com/relative/url/'>Note the Single Quote</a>
<img src="//example.com/protocol-relative-img.jpg" /> <!-- Unmodified -->

I DON'T want to replace protocol relative URLs, since they already function as absolute URLs. I've come up with some code that works, but I'm wondering if I can clean it up a little, as it's extremely repetitive.

But I have to account for single and double quoted attribute values for src, href, and action (am I missing any attributes that can have relative URLs?) while simultaneously avoiding protocol relative URLs.

Here's what I have so far:

// Make URL replacement protocol relative to not break insecure/secure links
$url = str_replace( array( 'http://', 'https://' ), '//', $url );

// Temporarily Modify Protocol-Relative URLS
$str = str_replace( 'src="//', 'src="::TEMP_REPLACE::', $str );
$str = str_replace( "src='//", "src='::TEMP_REPLACE::", $str );
$str = str_replace( 'href="//', 'href="::TEMP_REPLACE::', $str );
$str = str_replace( "href='//", "href='::TEMP_REPLACE::", $str );
$str = str_replace( 'action="//', 'action="::TEMP_REPLACE::', $str );
$str = str_replace( "action='//", "action='::TEMP_REPLACE::", $str );

// Replace all other Relative URLS
$str = str_replace( 'src="/', 'src="'. $url .'/', $str );
$str = str_replace( "src='/", "src='". $url ."/", $str );
$str = str_replace( 'href="/', 'href="'. $url .'/', $str );
$str = str_replace( "href='/", "href='". $url ."/", $str );
$str = str_replace( 'action="/', 'action="'. $url .'/', $str );
$str = str_replace( "action='/", "action='". $url ."/", $str );

// Change Protocol Relative URLs back
$str = str_replace( 'src="::TEMP_REPLACE::', 'src="//', $str );
$str = str_replace( "src='::TEMP_REPLACE::", "src='//", $str );
$str = str_replace( 'href="::TEMP_REPLACE::', 'href="//', $str );
$str = str_replace( "href='::TEMP_REPLACE::", "href='//", $str );
$str = str_replace( 'action="::TEMP_REPLACE::', 'action="//', $str );
$str = str_replace( "action='::TEMP_REPLACE::", "action='//", $str );

I mean, it works, but it's uuugly, and I was thinking there's probably a better way to do it.


Solution

  • New Answer

    If your real html document is valid (and has a parent/containing tag), then the most appropriate and reliable technique will be to use a proper DOM parser.

    Here is how DOMDocument and Xpath can be used to elegantly target and replace your designated tag attributes:

    Code1 - Nested Xpath Queries: (Demo)

    $domain = '//example.com';
    $tagsAndAttributes = [
        'img' => 'src',
        'form' => 'action',
        'a' => 'href'
    ];
    
    $dom = new DOMDocument; 
    $dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
    $xpath = new DOMXPath($dom);
    foreach ($tagsAndAttributes as $tag => $attr) {
        foreach ($xpath->query("//{$tag}[not(starts-with(@{$attr}, '//'))]") as $node) {
            $node->setAttribute($attr, $domain . $node->getAttribute($attr));
        }
    }
    echo $dom->saveHTML();
    

    Code2 - Single Xpath Query w/ Condition Block: (Demo)

    $domain = '//example.com';
    $targets = [
        "//img[not(starts-with(@src, '//'))]",
        "//form[not(starts-with(@action, '//'))]",
        "//a[not(starts-with(@href, '//'))]"
    ];
    
    $dom = new DOMDocument; 
    $dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
    $xpath = new DOMXPath($dom);
    foreach ($xpath->query(implode('|', $targets)) as $node) {
        if ($src = $node->getAttribute('src')) {
            $node->setAttribute('src', $domain . $src);
        } elseif ($action = $node->getAttribute('action')) {
            $node->setAttribute('action', $domain . $action);
        } else {
            $node->setAttribute('href', $domain . $node->getAttribute('href'));
        }
    }
    echo $dom->saveHTML();
    

    Old Answer: (...regex is not "DOM-aware" and is vulnerable to unexpected breakage)

    If I understand you properly, you have a base value in mind, and you only want to apply it to relative paths.

    Pattern Demo

    Code: (Demo)

    $html=<<<HTML
    <img src="/relative/url/img.jpg" />
    <form action="/">
    <a href='/relative/url/'>Note the Single Quote</a>
    <img src="//site.com/protocol-relative-img.jpg" />
    HTML;
    
    $base='https://example.com';
    
    echo preg_replace('~(?:src|action|href)=[\'"]\K/(?!/)[^\'"]*~',"$base$0",$html);
    

    Output:

    <img src="https://example.com/relative/url/img.jpg" />
    <form action="https://example.com/">
    <a href='https://example.com/relative/url/'>Note the Single Quote</a>
    <img src="//site.com/protocol-relative-img.jpg" />
    

    Pattern Breakdown:

    ~                      #Pattern delimiter
    (?:src|action|href)    #Match: src or action or href
    =                      #Match equal sign
    [\'"]                  #Match single or double quote
    \K                     #Restart fullstring match (discard previously matched characters
    /                      #Match slash
    (?!/)                  #Negative lookahead (zero-length assertion): must not be a slash immediately after first matched slash
    [^\'"]*                #Match zero or more non-single/double quote characters
    ~                      #Pattern delimiter