I wrote a little script that extracts information from a web site using PHP's DOMXPath
class.
I query for <div class="sku" />
and execute a substring-before
on the result. The result contains text, non breaking spaces, a line break and more text.
So what I'm trying to do is cut before the \r\n
. It works fine when I use the following query:
$query = "substring-before(//div[@class='sku'],'\xC2\xA0\xC2\xA0\r\n')";
but fails as soon as I change the quotes (which shouldn't make any difference):
$query = 'substring-before(//div[@class="sku"],"\xC2\xA0\xC2\xA0\r\n")';
or
$query = 'substring-before(//div[@class=\'sku\'],\'\xC2\xA0\xC2\xA0\r\n\')';
How is this possible and how can I overcome this?
Live example here: http://codepad.viper-7.com/R1rCaj
The style of quotes makes a difference because when a string is enclosed in double-quotes PHP will interpret more escape sequences for special characters - including what you're using for non-breaking space \xC2\xA0
, carriage return \r
, and newline \n
.
When you have these enclosed in single-quotes '\xC2\xA0\r\n'
, like in your second two queries, PHP treats them as those literal characters - backslash, x, C, 2... etc.
A little extra syntax highlighting may help show this off, escape sequences in orange:
If your string already has what would be escape sequences in it as literal characters, and there's no way to get that corrected*, you're in the kinda dirty position of replacing them yourself.
This preg_replace_callback()
will take care of the sort of sequences in your example, and it's trivial to extend to the rest of the escape sequences supported by double-quotes:
// Known good.
$query1 = "substring-before(//div[@class='sku'],'\xC2\xA0\xC2\xA0\r\n')";
// Known bad.
$query2 = 'substring-before(//div[@class=\'sku\'],\'\xC2\xA0\xC2\xA0\r\n\')';
$query2 = preg_replace_callback(
'/\\\\(?:[rn]|(?:x[0-9A-Fa-f]{1,2}))/',
function ($matches) {
switch (substr($matches[0], 0, 2)) {
case '\r':
return "\r";
case '\n':
return "\n";
case '\x':
return hex2bin(substr($matches[0], 2));
}
},
$query2
);
var_dump($query1 === $query2); // Now equal?
Output:
bool(true)
(*Really, you should get this fixed at the source.)