Search code examples
cpointersendianness

Big Endian and Little endian little confusion


I was reading about little and big endian representations from this site http://www.geeksforgeeks.org/little-and-big-endian-mystery/.

Suppose we have a number 0x01234567, then in little endian it is stored as (67)(45)(23)(01) and in Big endian it is stored as (01)(23)(45)(67).

char *s= "ABCDEF"
int *p = (int *)s;
printf("%d",*(p+1)); // prints 17475 (value of DC)

After seeing the printed value here in the above code, it seems that string is stored as (BA)(DC)(FE).

Why is it not stored like (EF)(CD)(AB) from LSB to MSB as in first example? I thought that endianess means ordering of bytes within multi-bytes. So the ordering should be with respect to "whole 2 bytes" as in 2nd case and not within those 2 bytes right?


Solution

  • Working with 2 byte ints, this is what you have in memory

    memAddr  |  0  |  1  |  2  |  3  |  4  |  5  |  6   |
    data     | 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | '\0' |
                ^ s points here
                            ^ p+1 points here
    

    Now, it looks like you're using ASCII encoding, so this is what you really have in memory

    memAddr  |  0   |  1   |  2   |  3   |  4   |  5   |  6   |
    data     | 0x41 | 0x42 | 0x43 | 0x44 | 0x45 | 0x46 | 0x00 |
                ^ s points here
                              ^ p+1 points here
    

    So for a little endian machine, that means the least significant bytes for a multi-byte type come first. There's no concept of endianess for a single byte char. An ASCII string is just a string of chars.. this has no endianess. Your ints are 2 bytes. So for an int starting at memory location 2, this byte is the least significant, and the one at address 3 is the most significant. This means the number here, read the way people generally read numbers, is 0x4443 (17475 in base 10, "DC" as an ASCII string), since 0x44 in memory location 3 is more significant than 0x43 in memory location 2. For big endian, of course, this would be reversed, and the number would be 0x4344 (17220 in base 10, "CD" as an ASCII string).

    EDIT:

    Addressing your comment... A c string is a NUL terminated array of chars, that's absolutely correct. Endianess only applies to the primitive types, short, int, long, long long, etc. ("primitive types" may be incorrect nomenclature, someone who knows can correct me). An array is simply a section of contiguous memory where 1 or more types occur directly next to each other, stored sequentially. There is no concept of endianess for the entire array, however, endianess does apply to the primitive types of the individual elements of the array. Let's say you have the following, assume 2 byte ints:

    int array[3];  // with 2 byte ints, this occupies 6 contiguous bytes in memory
    array[0] = 0x1234;
    array[1] = 0x5678;
    array[2] = 0x9abc;
    

    This is what memory looks like: It will look like this no matter for a big or little endian machine

    memAddr   |    0-1   |    2-3   |    4-5   |
    data      | array[0] | array[1] | array[2] |
    

    Notice there is no concept of endianess for the array elements. This is true no matter what the elements are. The elements could be primitive types, structs,, anything. The first element in the array is always at array[0].

    But now, if we look at the what's actually in the array, this is where endianess does come into play. For a little endian machine, memory will look like this:

    memAddr   |  0   |  1   |  2   |  3   |  4   |  5   |
    data      | 0x34 | 0x12 | 0x78 | 0x56 | 0xbc | 0x9a |
                 ^______^      ^______^      ^______^
                 array[0]      array[1]      array[2]
    

    The least significant bytes are first. A big endian machine would look like this:

    memAddr   |  0   |  1   |  2   |  3   |  4   |  5   |
    data      | 0x12 | 0x34 | 0x56 | 0x78 | 0x9a | 0xbc |
                 ^______^      ^______^      ^______^
                 array[0]      array[1]      array[2]
    

    Notice the contents of each element of the array is subject to endianess (because it's an array of primitive types.. if it was an array of structs, the struct members wouldn't subject to some kind of endianess reversal,, endianess only applies to primitives). However, whether on the big or little endian machine, the array elements are still in the same order.

    Getting back to your string, a string is simply a NUL terminated array of characters. chars are single bytes, so there's only one way to order them. Consider the code:

    char word[] = "hey";
    

    This is what you have in memory:

    memAddr   |    0    |    1    |    2    |    3    |
    data      | word[0] | word[1] | word[2] | word[3] |
                      equals NUL terminator '\0' ^
    

    Just in this case, each element of the word array is a single byte, and there's only one way to order a single item, so whether on a little or big endian machine, this is what you'll have in memory:

    memAddr   |  0   |  1   |  2   |  3   |
    data      | 0x68 | 0x65 | 0x79 | 0x00 |
    

    Endianess only applies to multi-byte primitive types. I highly recommend poking around in a debugger to see this in live action. All the popular IDEs have memory view windows, or with gdb you can print out memory. In gdb you can print memory as bytes, halfwords (2 bytes), words (4 bytes), giant words (8 bytes), etc. On a little endian machine, if you print out your string as bytes, you'll see the letters in order. Print out as halfwords, you'll see every 2 letters "reversed", print out as words, every 4 letters "reversed", etc. On a big endian machine, it would all print out in the same "readable" order.