Search code examples
node.jslinuxfilewindows-1252cp1252

Linux using command file -i return wrong value charset=unknow-8bit for a windows-1252 encoded file


Using nodejs and iconv-lite to create a http response file in xml with charset windows-1252, the file -i command cannot identify it as windows-1252.

Server side:

r.header('Content-Disposition', 'attachment; filename=teste.xml');
r.header('Content-Type', 'text/xml; charset=iso8859-1');
r.write(ICONVLITE.encode(`<?xml version="1.0" encoding="windows-1252"?><x>€Àáção</x>`, "win1252")); //euro symbol and portuguese accentuated vogals
r.end();

The browser donwloads the file and then i check it in Ubuntu 20.04 LTS:

file -i teste.xml
/tmp/teste.xml: text/xml; charset=unknown-8bit

When i use gedit to open it, the accentuated vogal appear fine but the euro symbol it does not (all characters from 128 to 159 get messed up).

I checked in a windows 10 vm and in there all goes well. Both in Windows and Linux web browsers, it also shows all fine.

So, is it a problem in file command? How to check the right charsert of a file in Linux?

Thank you

EDIT The result file can be get here

2nd EDIT I found one error! The code line:

    r.header('Content-Type', 'text/xml; charset=iso8859-1');

must be:

r.header('Content-Type', 'text/xml; charset=Windows-1252');

Solution

  • It's important to understand what a character encoding is and isn't.

    A text file is actually just a stream of bits; or, since we've mostly agreed that there are 8 bits in a byte, a stream of bytes. A character encoding is a lookup table (and sometimes a more complicated algorithm) for deciding what characters to show to a human for that stream of bytes.

    For instance, the character "€" encoded in Windows-1252 is the string of bits 10000000. That same string of bits will mean other things in other encodings - most encodings assign some meaning to all 256 possible bytes.

    If a piece of software knows that the file is supposed to be read as Windows-1252, it can look up a mapping for that encoding and show you a "€". This is how browsers are displaying the right thing: you've told them in the Content-Type header to use the Windows-1252 lookup table.

    Once you save the file to disk, that "Windows-1252" label form the Content-Type header isn't stored anywhere. So any program looking at that file can see that it contains the string of bits 10000000 but it doesn't know what mapping table to look that up in. Nothing you do in the HTTP headers is going to change that - none of those are going to affect how it's saved on disk.

    In this particular case the "file" command could look at the "encoding" marker inside the XML document, and find the "windows-1252" there. My guess is that it simply doesn't have that functionality. So instead it uses its general logic for guessing an encoding: it's probably something ASCII-compatible, because it starts with the bytes that spell <?xml in ASCII; but it's not ASCII itself, because it has bytes outside the range 00000000 to 01111111; anything beyond that is hard to guess, so output "unknown-8bit".