Search code examples
phpregexcurlutf-8domxpath

PHP regex not matching utf-8 decoded string


I am having trouble with some a regex statement. I'm not sure why it is doing this, however I think it may have something to do with character encoding.

So I am using curl to receive the page content from a website. Then I am using domXPath query to get a certain element, then from that element I get its content, then from that content I perform a regex statement. However the regex statement is not working and I don't know why.

This is what I receive from the element:

X: asdasdfgdgdrrY: dfgdfgfgZ: ukuykyukjghj
  a B 7dd. 

Now when I try to match it with this code:

/X: (?P<x>.*)Y: (?P<y>.*)Z: (?P<z>.*)\s*(?P<a>[a-zA-Z]+) (?P<b>[a-zA-Z]+) (?P<c>[0-9]+)dd/

I have tested this in Dreamweaver and it matches so I have no idea what it wouldn't online

Also the page I am receiving has a content of utf-8,

I attempt to convert the content to remove the utf-8 characters by using

iconv('utf-8', 'ISO-8859-1//IGNORE', $td->item(0)->nodeValue);

if I don't remove the utf-8 characters there are weird Á symbols after the 'a', 'b' and 'c' variable values.


Solution

  • Ok I figured it out, all i had to do to get rid of these invisible invalid characters was:

    $value = preg_replace("/[^a-zA-Z0-9 %():\$.\/-]/",' ',$value);
    

    pre much just replace any character that wasnt valid, with a space, or blank. In my case I used space because it appeared some spaces were invalid.