The following problem confused me a lot:
I was experimenting on how doubles
and especially their 'special' values like PositiveInfinity
are stored in a file, which was no problem. I have done this in three simple steps: Creating a double
; writing it into a file; reading the file into a byte
-array. This was quite easy and now I know how a Double.NaN
looks like in a binary format :)
But then I came across the following:
According to the .Net-Framework there is a NegativeZero
:
internal static double NegativeZero = BitConverter.Int64BitsToDouble(unchecked((long)0x8000000000000000));
The way it is represented is quite simple (following IEEE 754):
The long
represents a binary number: 10000000...
The first bit says that the double
is negative. So what happens to represent the NegativeZero
is - 0 * 2^0
as the mantissa and the exponent are both 0
.
Representing a 'normal' 0 would then be 64 bits all set to 0
.
But the problem is reading those numbers into a byte
array. What I supposed is the following for the NegativeZero
: 128
0
0
... [binary: 100000...]
But in fact it was the wrong way round: 0
0
...128
! [binary: 00000...0 10000000]
My first thought was: 'Maybe File.ReadAllBytes()
returns everthing in the wrong order (which would be awkward)'. So I decided to test the reader with a string
(-> created a file with a string; read it into byte
array)
The result was just fine: 'Hello' was still 'Hello' in the byte
array and not as the example from above proposed 'olleH'.
Again everthing in a nutshell:
Writing a binary number (10000000 00000000 00000000) to a file works fine.
Reading the same binary number into a byte
array turns out to be:
[0]00000000
[1]00000000
[2]10000000
Reading the file can't be the problem as strings
stay the same.
BUT: Interpreting the byte
array back to the original variable (long, double...) returns the correct result.
So from my view it looks like the bytes
of a variable are stored in the wrong order.
Is this true? And if so, why is it done like this, because from my view it seems like violating IEEE 754 (but it obviously works)?
And please correct me if I'm missing anything here as I am still too confused after hours of searching an answer to this problem...
There is no universal rule about what order bytes should be in a multi-byte structure.
The little-endian approach would put the four-byte number 0x01020304
into bytes in the order 0x04
, 0x03
, 0x02
, 0x01
.
The big-endian approach would put the same four-byte number into bytes in the order 0x01
, 0x02
, 0x03
, 0x04
.
Neither of these is correct of incorrect, though obviously a system using one approach needs some conversion to interoperate with a system using the other.
(There are even strange combinations like 0x03
, 0x04
, 0x01
, 0x02
or 0x02
, 0x01
, 0x04
, 0x03
but they are much rarer, and generally come about due to something treating 4-byte values as two two-byte values with a big-endian approach to ordering and then treating those in a little-endian approach, or vice-versa).
If you are doing .NET you are probably using an Intel chip or one compatible with it, and they use little-endian order for storing values in memory. Copying directly from memory to file or back will result in a little-endian file.
Now, a string is a sequence of characters, and its in-memory representation is a sequence of bytes in some order. As such with "Hello" we will have some sort of representation of H
followed by e
followed by l
and so on.
This will be the case whether the system is little-endian or big-endian.
If however the representation of one of those characters is not a single-byte, then that representation may be affected by endian-ness.
The most common modern representation for file use (and really the only one to use 99% of the time) is UTF-8. UTF-8 will define multi-byte sequences for characters with a code-point above U+007F, but the order of that sequence is defined by UTF-8 itself, and so is not affected by endianness.
The second most common modern representation (and the one to use for the remaining 1% of the time if you've a good reason to) is UTF-16. UTF-16 deals with characters as 16-bit units, or as two 16-bit units for characters above U+FFFF. In the cases of two 16-bit units being used, the order of those units is specified in UTF-16 itself. However, the order of the two octets representing those 16-bit units is not specified at this level, and are hence affected by endianness.
Hence UTF-16 can be represented in bytes as either UTF-16LE or UTF-16BE or as one or the other with a byte-order-mark at the start of a file to let reading software determine which is in use. As such, with UTF-16 "hello" could be:
0x00 0x68 0x00 0x65 0x00 0x6C 0x00 0x6C 0x00 0x6F
or it could be:
0x68 0x00 0x65 0x00 0x6C 0x00 0x6C 0x00 0x6F 0x00