Search code examples
phputf-8file-get-contents

How to get file content with a proper utf-8 encoding using file_get_contents?


I need to get content of the remote file in utf-8 encoding. The file in in utf-8. When I display that file on screen, it has proper encoding:

http://www.parfumeriafox.sk/source_file.html

(notice the ň and č characters, for example, these are alright).

When I run this code:

<?php

$url = 'http://parfumeriafox.sk/source_file.html';

$csv = file_get_contents_utf8($url);
header('Content-type: text/html; charset=utf-8');
print $csv;

function file_get_contents_utf8($fn) {
  $content = file_get_contents($fn);
  return mb_convert_encoding($content, 'utf-8');
}

(you can run it using http://www.parfumeriafox.sk/encoding.php), then I get question marks instead of those special characters. I have done huge research on this, I have tried standard file_read_contents function, I have even used some stream bla bla php context function, I also tried fopen and fread function to read that file on binary level, nothing seems to work. I have tried that with and without sending header. This is supposed to be perfectly siple, what am I doing wrong? When I check that string with some encoding detect function, it returns UTF-8.


Solution

  • How about this one????

    For this one I used header('Content-Type: text/plain;; charset=Windows-1250');

    bergamot, citrón, tráva, rebarbora, bazalka;levanduľa, škorica, hruška;céderové drevo, vanilka, pižmo, amberlyn


    enter image description here


    This code works for me

    <?php
    header('Content-Type: text/plain;charset=Windows-1250');
    echo file_get_contents('http://www.parfumeriafox.sk/source_file.html');
    ?>
    


    The problem is not with file_get_contents()

    I save the $data to a file and the characters were correct but still not encoded correctly by my text editor. See image below.

    $data = file_get_contents('http://www.parfumeriafox.sk/source_file.html');
    file_put_contents('doc.txt',$data);
    

    UPDATE

    Seems to be one problematic character as shown here. It also is seen on the HTML image below. Renders as ¾

    Its Hex value is xBE (190 decimal)

    I tried these two character sets. Neither worked.

    header('Content-Type: text/plain; charset=ISO 8859-1');
    header('Content-Type: text/plain; charset=ISO 8859-2');
    



    enter image description here


    END OF UPDATE


    It works by adding a header WITHOUT charset=utf-8.

    These two headers work

    header('Content-Type: text/plain');
    header('Content-Type: text/html');
    

    These two headers do NOT work

    header('Content-Type: text/plain; charset=utf-8');
    header('Content-Type: text/html; charset=utf-8');
    

    This code is tested and displayed all characters.

    <?php
    header('Content-Type: text/plain');
    echo file_get_contents('http://www.parfumeriafox.sk/source_file.html');
    ?>
    

    enter image description here

    <?php
    header('Content-Type: text/html');
    echo file_get_contents('http://www.parfumeriafox.sk/source_file.html');
    ?>
    

    enter image description here



    These are some of the problematic characters with their Hex values.
    This is the saved file viewed in Notepad++ with UTF-8 Encoding.

    enter image description here

    Check the Hex values against these character sets.

    enter image description here

    From the above table I saw the character set was Latin2.

    I went to Wikipedia Windows code page and found that Latin2 is Windows-1250


    bergamot, citrón, tráva, rebarbora, bazalka;levanduľa, škorica, hruška;céderové drevo, vanilka, pižmo, amberlyn