preg_match and file_get_contents and æ ø å

I have a question about preg_match, if I try to fetch something like this: Århus er en by i Danmark means Århus is a city in Denmark

preg_match( "#<div id=[\"']faktaDiv[\"']>(.*?)</div>#si", $webside, $a2 );

echo $a2;

Then the output will be:

�rhus er en by i Danmark means �rhus is a city in Denmark

How can I fix this? Basically it needs to allow æ ø å.

Solution

For the regex approach you need the u modifier. For a full list of PHP's modifiers see http://php.net/manual/en/reference.pcre.pattern.modifiers.php, the i and s you are currently using are 2 other modifiers.

preg_match( "#<div id=[\"']faktaDiv[\"']>(.*?)</div>#siu", $webside, $a2 );

It looks like you are parsing HTML though so I'd use the domdocument to parse that string.

$doc = new DOMDocument(); $doc->loadHTML('<div id="faktaDiv">Test Stuff</div>'); $divs = $doc->getElementsByTagName('div'); foreach($divs as $div) { if($div->getAttribute('id') == 'faktaDiv') { echo $div->nodeValue; } }

To pull the title you should use a parser like this.

$doc = new DOMDocument();
$doc->loadHTML('<title>Test Stuff</title>');
$title = $doc->getElementsByTagName('title')->item(0)->nodeValue;
echo $title;

As far as I know there should only be one title one a page. If this isn't the case take off the ->item(0)->nodeValue and loop through the array.

PHP Demo: https://eval.in/502432