Parsing network data for little and big endian

How can I write a common datagram parser for both a big and little endian system? What I don't understand is how to pass bytes from a byte buffer 16 bits or 32 bits at a time...

Suppose you have this datagram payload

uint8_t datagram[8] = {0xF1, 0xF2, 0xF3, 0xF4, 0xF5, 0xF6, 0xF7, 0xF8};

And some protocol says

Param a - 8 bits
Param b - 16 bits (little endian)
Param c - 8 bits
Param d - 32 bits (little endian)

So you want a common parser that would work on big and little endian machines

uint8_t param_a = datagram[0];
uint16_t param_b = datagram[1]; // how to use ntohs here?
uint8_t param_c = datagram[3]; 
uint32_t param_d = datagram[4]; // how to use ntohl here?

Is it better to just cast to structs instead?

Solution

Is it better to just cast to structs instead?

No. There are other problems than just endianness; like structure padding (where the amount is "compiler implementation specific") and alignment. For example, this structure:

struct myStructure {
    uint8_t Param_a;
    uint16_t Param_b;
    uin8_t Param_c;
    uint16_t Param_d;
}

..is likely to become more like:

struct myStructure {
    uint8_t Param_a;
    uint8_t padding1;    // Inserted by compiler
    uint16_t Param_b;
    uin8_t Param_c;
    uint8_t padding2;    // Inserted by compiler
    uint16_t Param_d;
}

..but could also become this (or anything else):

struct myStructure {
    uint8_t Param_a;
    uint8_t padding1[3];  // Inserted by compiler
    uint16_t Param_b;
    uint8_t padding2[2];  // Inserted by compiler
    uin8_t Param_c;
    uint8_t padding3[3];  // Inserted by compiler
    uint16_t Param_d;
}

For network protocols (where the layout of data must match exactly); this will break everything, even if all computers on the network are little-endian. To prevent problems compilers provide ways to force a structure to be "packed" (without padding) - e.g. struct __attribute__((__packed__)) myStructure { in GCC. However; some CPUs can't handle misaligned reads so this can break things in a different way (e.g. cause performance problems and cause atomic operations to fail) so you don't want to use "packed" structures while working on the data afterwards.

It's also worth mentioning that (in general) nothing from outside your code (e.g. user input, data from files, data from network) should be "assumed valid". It may have been maliciously constructed to exploit "unexpected cases" in your code; it might be the result of a bug in some other code; and it might be the result of a hardware failure. In any case you need to sanity check the data before use (and hopefully report any problems that were found with the data; to make it much easier to do nice user interfaces, or to find/fix bugs in other people's code faster and avoid "unexplained symptoms" in your code). To ensure this happens correctly, it's a good idea to use the language's type system - specifically; have one type for "raw and unchecked" data (e.g. an array of uint8_t) and a different type for "sanity checked data" (e.g. a struct myStructure), so that any accident/mistake (e.g. assuming data has been checked when it hasn't) will result in a "type mismatch" error at compile time. Of course this means you'd be writing code to convert from one type to another (while doing sanity checks) that also solves problems involving data layout (e.g. compiler specific padding, endianness).

For example:

struct myStructure {
    uint8_t Param_a;           // Must be a value from 0 to 100
    uint16_t Param_b;          // Must be a value >= "year 2000"
    uin8_t Param_c;            // Flags. Must be 1, 2, 4 or 6.
    uint16_t Param_d;          // Sender's "Request ID" (can be anything - always returned as is in reply packet so sender can figure out which reply is for which request)
}

int parseRawData(struct myStructure *outData, uint8_t **inputBuffer, size_t *inputBufferSize) {
    uint8_t a;
    uint16_t b;
    uin8_t c;
    uint16_t d;

    // Check size of data received

    if(*inputBufferSize == 0) {
        return 1;   // No data
    }
    if(*inputBufferSize <= 6) {
        return 2;   // Not enough data (yet) - can happen for "split packets" in TCP streams
    }

    // Parse raw data and do sanity checks

    a = (*inputBuffer)[0];
    if(a > 100) {
        return 10;    // Value out of range for param_a
    }

    b = (*inputBuffer)[1] | (*inputBuffer)[2];
    if(b < 2000) {
        return 20;    // Value out of range param_b
    }

    c = (*inputBuffer)[3];
    switch(c) {
    case 1:
    case 2:
    case 4:
    case 6:
        break;
    default:
        return 30;    // Bad value or unsupported value for param_c
    }

    d = (*inputBuffer)[4] | (*inputBuffer)[5];

    // Data was valid, so store it and update the buffer tracking

    outData->Param_a = a;
    outData->Param_b = b;
    outData->Param_c = c;
    outData->Param_d = d;
    *inputBuffer += 20;
    *inputBufferSize -= 20;
    return 0;                     // No problem!
}

Of course you'd probably also want to use an enum for error codes, and might want some kind of "convert 2 bytes in buffer into uint16_t" macro.

About ntohl, htonl, ntohs, htons

Almost all computers are little-endian, so (when designing anything - e.g. network protocols, file formats, etc) you want to use little-endian to improve performance on almost all computers. For historical reasons "network order" is big-endian, which makes ntohl, htonl, ntohs, htons useless when you want to ensure the data is little-endian.