Search code examples
phpregexpreg-matchfile-get-contents

preg_match and file_get_contents and æ ø å


I have a question about preg_match, if I try to fetch something like this: Århus er en by i Danmark means Århus is a city in Denmark

preg_match( "#<div id=[\"']faktaDiv[\"']>(.*?)</div>#si", $webside, $a2 );

echo $a2;

Then the output will be:

�rhus er en by i Danmark means �rhus is a city in Denmark

How can I fix this? Basically it needs to allow æ ø å.


Solution

  • For the regex approach you need the u modifier. For a full list of PHP's modifiers see http://php.net/manual/en/reference.pcre.pattern.modifiers.php, the i and s you are currently using are 2 other modifiers.

    preg_match( "#<div id=[\"']faktaDiv[\"']>(.*?)</div>#siu", $webside, $a2 );
    

    It looks like you are parsing HTML though so I'd use the domdocument to parse that string.

    $doc = new DOMDocument();
    $doc->loadHTML('<div id="faktaDiv">Test Stuff</div>');
    $divs = $doc->getElementsByTagName('div');
    foreach($divs as $div) {
        if($div->getAttribute('id') == 'faktaDiv') {
             echo $div->nodeValue;
        }
    }
    

    To pull the title you should use a parser like this.

    $doc = new DOMDocument();
    $doc->loadHTML('<title>Test Stuff</title>');
    $title = $doc->getElementsByTagName('title')->item(0)->nodeValue;
    echo $title;
    

    As far as I know there should only be one title one a page. If this isn't the case take off the ->item(0)->nodeValue and loop through the array.

    PHP Demo: https://eval.in/502432