Search code examples
cbytebitprimitive

Bits, bytes, character, and integer sizes in C


From what I've googled and found on stackoverflow, I've seen that a character is always 1 byte. This is for sure. But, I've also seen that it may not be 8 bits (it could be greater, see CHAR_BITS). Does this mean that if it is, lets say, 10 bits, in some hypothetical architecture, then 1 byte = 10 bits? Does this imply that if the implementation defaults integers to 4 bytes, then the size of an int will be 40 bits? I'm a little wonky with all this bit-byte variation, despite the insane number of threads on the internet.


Solution

  • For modern platforms these sizes are predictable, byte is eight bits, int is four or eight bytes, and so on. The definition and implementation of "bytes" is unlikely to change as this version of it coalesced around the introduction of 8-bit CPUs in the 1970s which arguably lead to the triumph of ASCII over competing standards. It has become so ingrained that some languages treat it literally: in French the word for "byte" is "octet".

    For historical platforms all bets are off. Some use 10, 12, 18 or 36 bit integers. Early computers didn't use bytes at all but instead words which are of arbitrary size. Things are comprised of these units, where a "double word" value might be 48 bits for whatever reason. This is from an era when a single bit might be represented by several vacuum tubes or full-sized transistors so practical cost concerns lead to some very unusual designs.

    Now when it comes to characters, they are not necesarily one byte. In UNICODE they can be a lot more than that, especially when expressed as UTF-8, UTF-16 or UTF-32, the common encoding methods for this sort of text. It is only in 8-bit encodings like Latin1 that characters and bytes are interchangeable.

    It's important to consider that while most consumer CPUs are pretty homogeneous in their sizes, this all goes out the window when dealing with specialized devices like DSPs or custom FPGA processors.

    The good news is that you're unlikely to do character processing on a DSP or custom FPGA processor that's built around unusual register sizes. Those are usually focused on processing some other kind of digital data.