Search code examples
utf-8character-encodingcharacteransi

Converting ANSI to UTF-8 inserts characters before the doctype


Good day,

I'm trying to convert my site from ANSI-encoded PHP files to UTF-8. I converted my header.php and footer.php files to UTF-8 but when I convert my index.php, the page renders incorrectly.

index.php encoded in ANSI:

<?php
include 'header.php';
echo '<h1>ANSI</h1>';
include 'footer.php';
?>

Outputs: http://www.quimp.com/gce/ansi.jpg


index.php encoded in UTF-8: (converted from notepad++)

<?php
header('Content-Type: text/html; charset=utf-8');

include 'header.php';
echo '<h1>UTF-8</h1>';
include 'footer.php';
?>

Outputs: http://www.quimp.com/gce/utf8.jpg

When I check the source of the page, the output seems correct (the < head > content is where it should be). However, if I copy the source code of the UTF-8 version from the browser and paste it in notepad++, some characters are prepended. They look like a line-break and an accent on the "<":

<!DOCTYPE html> // htmlentities() output

%0A%EF%BB%BF%3C%21DOCTYPE+html%3E%0A // urlencode() output

After removing these characters, the page renders properly. The site is www.quimp.com. The content of header.php can be found here: quimp.com/gce/header.txt

I searched a ton but couldn't find a similar problem. Any idea what might cause this?

Thanks a lot for your time! -Ben


Solution

  • It's a BOM (byte order mark).

    UTF-16BE and UTF-16LE files (big-endian and little-endian) often start with a BOM (Unicode character 0xFEFF) so you can detect the endianness of the file.

    UTF-8 doesn't have this issue, but some converters insert a BOM anyway. This will show up as 3 bytes as the beginning of the file, the UTF-8 representation of 0xFEFF.

    You didn't say how you're doing the conversion. Whatever tool you're using, see if you can find out how to tell it not to insert the BOM, or find a different tool.

    EDIT: Confirmed, I just took a look at http://quimp.com/gce/header.txt, and it's a UTF-8-encoded file starting with an FEFF character.