I am trying to modify links in a string containing HTML but am finding the modified URLs are missing parameters.
Example:
$html = '
<p>
<a href="http://example.com?foo=bar&bar=foobar">Example 1</a>
</p>';
libxml_use_internal_errors(true);
$dom = new \DOMDocument();
$dom->loadHTML($html);
$xpath = new \DOMXPath($dom);
foreach ($xpath->query('//a/@href') as $node) {
echo '$node->nodeValue: ' . $node->nodeValue . PHP_EOL;
$newValue = 'http://example2.com?foo=bar&bar=foobar';
echo '$newValue: ' . $newValue . PHP_EOL;
$node->nodeValue = $newValue;
echo '$node->nodeValue: ' . $node->nodeValue . PHP_EOL;
}
Output:
$node->nodeValue: http://example.com?foo=bar&bar=foobar
$newValue: http://example2.com?foo=bar&bar=foobar
$node->nodeValue: http://example2.com?foo=bar
As you can see, the second parameter is lost after updating the nodeValue
.
While experimenting I tried changing $newValue
to this:
$newValue = htmlentities('http://example2.com?foo=bar&bar=foobar');
And the output then becomes:
$node->nodeValue: http://example.com?foo=bar&bar=foobar
$newValue: http://example2.com?foo=bar&bar=foobar
$node->nodeValue: http://example2.com?foo=bar&bar=foobar
Why is it necessary for the new node value to be run through htmlentities()
?
Ampersands are reserved characters in XML/HTML — they begin character references. If you try to write them directly to strings in the DOM things often blow up because the DOM doesn't know what you're trying to say. When you use htmlentities()
first it encodes the "&" and everyone is speaking the same language again.
Fortunately there's no need for htmlentities()
at all. Instead of setting the nodeValue
directly, use the setAttribute()
method of the href's owner.
$node->nodeValue = $newValue;
$node->ownerElement->setAttribute('href', $newValue);
Directly manipulating strings in the DOM can lead to problems that won't even necessarily manifest the same across systems. I didn't lose parameters with your example, I lost the entire URL.
I highly recommend sticking with setters whenever possible.