Search code examples
c#endianness

If you have data with the same endianness as the system, can you simply 1-1 map bytes into an integer type?


I am writing a class which lets me convert between bytes and the various integer data types. Instead of reversing arrays and then converting data, I have opted to determine if the endianness of the system is the same as the data. If it is, I simply map the data to the integer, like this in the case of a 64-bit integer:

result = (long)(
    (buffer[index] << 56) |
    (buffer[index + 1] << 48) |
    (buffer[index + 2] << 40) |
    (buffer[index + 3] << 32) |
    (buffer[index + 4] << 24) |
    (buffer[index + 5] << 16) |
    (buffer[index + 6] << 8) |
    (buffer[index + 7]));

And if the endianness of the system and data differ, it would be reversed like so:

result = (long)(
    (buffer[index]) |
    (buffer[index + 1] << 8) |
    (buffer[index + 2] << 16) |
    (buffer[index + 3] << 24) |
    (buffer[index + 4] << 32) |
    (buffer[index + 5] << 40) |
    (buffer[index + 6] << 48) |
    (buffer[index + 7] << 56));

result is a 64-bit signed integer

buffer is a byte array

index is a 32-bit signed integer indicating the position in the buffer to begin reading

My question is... am I doing this wrong or is this just a really simple way to do the conversion without having to reverse the array in place or make copies?

This seems like it should work for all combinations of system and data endianness and convert between the two correctly.

Is there perhaps a different way that may be easier to read or generally more simple?


Solution

  • There are two main scenarios when converting between integers and their byte representation:

    Native endianness

    This is typically the case when interoperating with native code. Use code that naturally uses native endianness, such as Buffer.BlockCopy, BitConverter.ToBytes/ToInt64 and unsafe code. In some cases the p/invoke marshaller can do most of the work for you.

    Fixed endianness

    This is typically the case when parsing files or network protocols. In that case your code pieces (minus the casting bug) are the ideal way to handle it. Give them a name that mentions the endianness, such as ToInt64BitEndian.

    They are easy to understand, easy to test (don't depend on system endianness) and reasonably fast.

    Occasionally it can give a performance boost to use Buffer.BlockCopy or unsafe reinterpret casting, but I'd only use those after profiling that indicates a bottleneck in this code. In my programs this has never been a bottle-neck, so I use code pretty similar to your examples.

    I don't like reversing based code for this, since the code path for big-endian systems won't get exercised on a typical little-endian system.


    ErrataRob's code review of silent circle makes a similar point, elaborating a bit more:

    Protocol parsing is CPU independent. There is never a reason to do something different depending upon the CPU.

    Casting and byte-swapping

    The mistake of doing an #if conditional above comes from trying to fix an underlying mistake of casting between char* and int*. This is a common technique taught in your “UNIX Network Programming” class. It’s also wrong. You should never do it when parsing packets.

    There are two reasons to avoid this. The first is that (as mentioned above) some CPUs, such as SPARC and some versions of ARM crash when referencing unaligned integers. This makes network code unstable on RISC systems, because most integers are usually aligned anyway, meaning a lot of alignment issues escape undetected into shipping code. The only way to make stable code is to stop casting integers in network (or file) parsers.

    The second problem is that it causes confusion with byte-order/endianess that doesn’t happen if you just don’t cast integers. Consider the IP address “10.1.2.3”. There are only two forms for this number, either an integer with the value of 0x0a010203, or an array of bytes with the value 0a 01 02 03. The problem is that little endian machines are weird. The integer 0x0a010203 is represented internally as 03 02 01 0a on x86 processors, with the order of bytes “swapped”.

    But this is just an internal detail that YOU NEVER NEED TO WORRY ABOUT. As long as you never cross the streams and cast from a char* to an int* (or the reverse), then the byte-order/endianness never matters.