Search code examples
stringassemblyendianness

Explaining (lack of) endianness as it applies to a string


For this question, I'm going to assume every character is single-byte ascii. If my understanding is correct, endianness applies to the byte-ordering of multi-byte words. Because strings only have one byte per character there is no endianness.

But this becomes a bit confusing to me, as strings are often stored with a nul character at the 'end' of the string, and wouldn't that make matter with respect to endianness? As an example,

.data
my_string: .asciz "Save"

Now in gdb to print the memory locations of S a v e:

>>> x/cb &string
0x4000b9:   'S'
>>> x/cb (char *) &string+1
0x4000ba:   'a'
>>> x/cb (char *) &string+2
0x4000bb:   'v'
>>> x/cb (char *) &string+3
0x4000bc:   'e'          # LSB at highest memory address (big endian??)

Isn't the string here essentially 'big endian' because the least significant byte (e) is stored at the highest memory address (string+3)?

What part am I missing with how endianness does or does not apply to strings? I think perhaps I may be mistaking char-array indexing for endian-ness but an answer to clearly point that out would be great.


Solution

  • The address space in this case is based on bytes, the individual addresses point at bytes. So you cannot have endianness with a byte quantity, it has to be multiple bytes.

    If you have

    0x1000 'S'  (0x53)
    0x1001 'a'  (0x61)
    0x1002 'v'  (0x76)
    0x1003 'e'  (0x65)
    

    There is no endianness there. A string is individual bytes that represent characters in memory linearly in sequential addresses.

    If you were to examine those BYTES, no longer as characters but as bytes with say a 32 bit WORD view then

    0x1000: 0x53617665 is a typical big endian view
    0x1000: 0x65766153 is a typical little endian view
    

    For the same data at the address 0x1000 when you do a 32 bit read. This is not a string at this point it is bytes being viewed 32 bits at a time at some address. It is an AND thing if you are trying to view/use the data as bytes AND a larger quantity, two views of the same data for some reason. An ASCII string is not something we view like that.

    Note strings, integers, floats, booleans, addresses, all the data types are irrelevant to the processor, bits is bits, they only mean something to the processor as well as user when used. Otherwise they are just bits with no meaning. You can "copy" a(n ASCII) "string" by doing word reads and writes like a memcpy() and yes to you it is a string, but it is just bytes being copied, for example. Big or little endian does not matter all of the bytes are picked up and put down in groups and it will still look like a string when viewed as a linear string of bytes by that processor and its addressing.

    There are exceptions to these general statements based on processors that have different endian modes and various other a-typical situations that I have certainly experienced but don't need to confuse things here. The general understanding is the low address byte is either the most significant (big endian) or least significant (little endian) byte in an access that is sized in multiple bytes (16 bit, 32 bit, 64, bit etc). Assuming a byte is 8 bits for your system, 9 and other size bytes would not change this would just change the size of the accesses.

    The biggest problem with endianness is that folks try to over complicate it. "OMG this is a X endian processor, I am used to a Y endian processor, it is going to make my life difficult I am going to have to play games with addressing, and do all this extra work." Nope, in general you just created a problem that was not there and now you have bugs you have to fix.

    The right answer is to understand the system first, do not think of that e-word, then when you see the busses or the peripherals and their interfaces or the data objects you need to move around from network or filesystems, etc. Then you compare them to the e-word of your computer and decide from a system engineering perspective does this already fit into the e-word of this system if I do this access to this thing, or do I need to shift or byte swap or otherwise convert the data so that when I perform operation X on that data it is oriented right. If you do not have to perform an actual operation, addition of some numbers, etc do you even care? If you are simply transferring data from point A to point B and the system engineering shows that there is no data manipulation required (reading a file from a hard drive and transmitting it over the network), then you do not need to think about or talk about the e-word.