Search code examples
network-programmingcpu-architectureendianness

Moving binary files between big endian and little endian platforms


I know that network byte order is big endian regardless of host endianness.

My question is what happens when a binary file is moved from a BE to LE platform. From this post I can see the data byte order on disk is the same as memory byte order of the platform: https://stackoverflow.com/a/5751824

So assume I have a small binary file on a LE machine with this content in LE order 2 5: (Note I have turned this into an actual binary file with xxd)

00000010 00000101
  • Use wget to download this file on a BE machine.
  • Checkout the file content and it's still 00000010 00000101.

Didn't the file had to be reversed to 00000101 00000010 before transmission? If so would't the BE machine store it as is in network order after it has received it? How is the file content not reversed after download?


Solution

  • My question is what happens when a binary file is moved from a BE to LE platform.

    The file bytes should be copied without being swapped at all — the bytes should appear in the same order on all machines, as elaborated byte by byte.

    For simple text it doesn't matter, since sequences of individual bytes are not endian in nature.

    For most anything else, there will be a defined file format, which specifies the endianness of numeric fields larger than 8 bits.  This is what you have already observed about TCP/IP, which defines big endian for the headers.

    JPG, PNG, others either avoid multi-byte numerics or define how the bytes in the file are interpreted when multi-byte numeric values are employed.

    Certain data formats will use a Byte Order Mark BOM, which is part of a flexible format that allows the writer to choose endianness (so can choose the one that is natural for the writing system, if desired), and, this allows reader to determine the endianness of the file.

    For multi-byte text, Unicode uses some of the above features, but the more modern encoding, UTF-8, is supposed to be interpreted as "simple" sequence of bytes (rather than multi-byte numbers) and doesn't need a BOM or notion of endianness.