I am having trouble with some a regex
statement. I'm not sure why it is doing this, however I think it may have something to do with character encoding.
So I am using curl to receive the page content from a website. Then I am using domXPath
query to get a certain element, then from that element I get its content, then from that content I perform a regex
statement. However the regex
statement is not working and I don't know why.
This is what I receive from the element:
X: asdasdfgdgdrrY: dfgdfgfgZ: ukuykyukjghj
a B 7dd.
Now when I try to match it with this code:
/X: (?P<x>.*)Y: (?P<y>.*)Z: (?P<z>.*)\s*(?P<a>[a-zA-Z]+) (?P<b>[a-zA-Z]+) (?P<c>[0-9]+)dd/
I have tested this in Dreamweaver and it matches so I have no idea what it wouldn't online
Also the page I am receiving has a content of utf-8,
I attempt to convert the content to remove the utf-8 characters by using
iconv('utf-8', 'ISO-8859-1//IGNORE', $td->item(0)->nodeValue);
if I don't remove the utf-8 characters there are weird Á
symbols after the 'a', 'b' and 'c' variable values.
Ok I figured it out, all i had to do to get rid of these invisible invalid characters was:
$value = preg_replace("/[^a-zA-Z0-9 %():\$.\/-]/",' ',$value);
pre much just replace any character that wasnt valid, with a space, or blank. In my case I used space because it appeared some spaces were invalid.