Currently I'm trying to read a file in a mime format which has some binary string data of a png.
In Windows, reading the file gives me the proper binary string, meaning I just copy the string over and change the extension to png and I see the picture.
An example after reading the file in Windows is below:
--fh-mms-multipart-next-part-1308191573195-0-53229
Content-Type: image/png;name=app_icon.png
Content-ID: "<app_icon>"
content-location: app_icon.png
‰PNG
etc...etc...
An example after reading the file in Linux is below:
--fh-mms-multipart-next-part-1308191573195-0-53229
Content-Type: image/png;name=app_icon.png
Content-ID: "<app_icon>"
content-location: app_icon.png
�PNG
etc...etc...
I am not able to convert the Linux version into a picture as it all becomes some funky symbols with a lot of upside down "?" and "1/2" symbols.
Can anyone enlighten me on what is going on and maybe provide a solution? Been playing with the code for a week and more now.
�
is a sequence of three characters - 0xEF
0xBF
0xBD
, and is UTF-8 representation of the Unicode codepoint 0xFFFD
. The codepoint in itself is the replacement character for illegal UTF-8 sequences.
Apparently, for some reason, the set of routines involved in your source code (on Linux) is handling the PNG header inaccurately. The PNG header starts with the byte 0x89
(and is followed by 0x50
, 0x4E
, 0x47
), which is correctly handled in Windows (which might be treating the file as a sequence of CP1252 bytes). In CP1252, the 0x89
character is displayed as ‰
.
On Linux, however, this byte is being decoded by a UTF-8 routine (or a library that thought it was good to process the file as a UTF-8 sequence). Since, 0x89 on it's own is not a valid codepoint in the ASCII-7 range (ref: the UTF-8 encoding scheme), it cannot be mapped to a valid UTF-8 codepoint in the 0x00-0x7F range. Also, it cannot be mapped to a valid codepoint represented as a multi-byte UTF-8 sequence, for all of multi-byte sequences start with a minimum of 2 bits set to 1 (11....
), and since this is the start of the file, it cannot be a continuation byte as well. The resulting behavior is that the UTF-8 decoder, now replaces 0x89
with the UTF-8 replacement characters 0xEF
0xBF
0xBD
(how silly, considering that the file is not UTF-8 to begin with), which will be displayed in ISO-8859-1 as �
.
If you need to resolve this problem, you'll need to ensure the following in Linux:
0xFFFD
it is actually the diamond character �) cannot be represented, and might result in further changes (unlikely, but you never know how the editor/viewer has been written).* Apparently, the Java Runtime will perform decoding of the byte sequence to UTF-16 codepoints, if you convert a sequence of bytes to a character or a String object.