I need to get content of the remote file in utf-8 encoding. The file in in utf-8. When I display that file on screen, it has proper encoding:
http://www.parfumeriafox.sk/source_file.html
(notice the ň
and č
characters, for example, these are alright).
When I run this code:
<?php
$url = 'http://parfumeriafox.sk/source_file.html';
$csv = file_get_contents_utf8($url);
header('Content-type: text/html; charset=utf-8');
print $csv;
function file_get_contents_utf8($fn) {
$content = file_get_contents($fn);
return mb_convert_encoding($content, 'utf-8');
}
(you can run it using http://www.parfumeriafox.sk/encoding.php), then I get question marks instead of those special characters. I have done huge research on this, I have tried standard file_read_contents
function, I have even used some stream bla bla php context function, I also tried fopen and fread function to read that file on binary level, nothing seems to work. I have tried that with and without sending header. This is supposed to be perfectly siple, what am I doing wrong? When I check that string with some encoding detect function, it returns UTF-8
.
How about this one????
For this one I used header('Content-Type: text/plain;; charset=Windows-1250');
bergamot, citrón, tráva, rebarbora, bazalka;levanduľa, škorica, hruška;céderové drevo, vanilka, pižmo, amberlyn
This code works for me
<?php
header('Content-Type: text/plain;charset=Windows-1250');
echo file_get_contents('http://www.parfumeriafox.sk/source_file.html');
?>
The problem is not with file_get_contents()
I save the $data to a file and the characters were correct but still not encoded correctly by my text editor. See image below.
$data = file_get_contents('http://www.parfumeriafox.sk/source_file.html');
file_put_contents('doc.txt',$data);
Seems to be one problematic character as shown here. It also is seen on the HTML image below. Renders as ¾
Its Hex value is xBE (190 decimal)
I tried these two character sets. Neither worked.
header('Content-Type: text/plain; charset=ISO 8859-1');
header('Content-Type: text/plain; charset=ISO 8859-2');
END OF UPDATE
It works by adding a header WITHOUT charset=utf-8.
These two headers work
header('Content-Type: text/plain');
header('Content-Type: text/html');
These two headers do NOT work
header('Content-Type: text/plain; charset=utf-8');
header('Content-Type: text/html; charset=utf-8');
This code is tested and displayed all characters.
<?php
header('Content-Type: text/plain');
echo file_get_contents('http://www.parfumeriafox.sk/source_file.html');
?>
<?php
header('Content-Type: text/html');
echo file_get_contents('http://www.parfumeriafox.sk/source_file.html');
?>
These are some of the problematic characters with their Hex values.
This is the saved file viewed in Notepad++ with UTF-8 Encoding.
Check the Hex values against these character sets.
From the above table I saw the character set was Latin2.
I went to Wikipedia Windows code page and found that Latin2 is Windows-1250
bergamot, citrón, tráva, rebarbora, bazalka;levanduľa, škorica, hruška;céderové drevo, vanilka, pižmo, amberlyn