Search code examples
encodingcharacter-encodingasciilatin

Character set vs codepage layout


1) Can anyone explain me why the ASCII and Latin-1 table is once in the chapter Character Set and once under Code page layout? I am fine if both terms are interchangbly used, but this is still inconsistent, or am I missing something?

2) Are ASCII and Latin-1 fully compatible? 0x00 to 0x1F don't seem to be defined in Latin-1, why?

enter image description here

enter image description here


Solution

  • A character set is a set of notional writing system concepts, such as capital Fraktur Z, line feed, or bicycle symbol. These include typographic style variations that have significant contexts for usage (e.g. mathematics) but not typical typeface (font) variations.

    Each codepoint in a character set is an element in a mapping between the "character" and an integer.

    A character encoding is an algorithm to convert between a codepoint in the character set and a sequence of one or more code units in the character encoding. Code units are integers. Integers wider than one byte have a byte order (endianness). A code unit is serialized to a sequence of bytes for streaming or storage. Character encoding functions often map both steps at once: between a codepoint and bytes.

    Many character sets have one character encoding. Many character encodings have single-byte code units. This makes them easy to present with the concepts of codepoint, code unit and byte collapses as well as character set and character encoding collapsed.

    This all has a long history. Terminology, focus and standards have evolved. The context can be a clue as to what is meant. "Code page" is/was often used when identifying a particular extension to ASCII. In some original standards, only the differences or extensions were documented. Vendor libraries often filled in gaps in the character sets so they would be completely defined over 256 codepoints. When the Unicode character set was being developed, transcoding tables between Unicode and other character set were accepted from vendors. This effectively standardized some character set to 256 codepoints. (You can see the Unicode codepoint in hexadecimal in your tables.)

    ASCII and Latin-1 (effectively the same as ISO 8859-1) are compatible in a limited sense: The first 128 codepoints and code unit values are the same. ISO-8859-1 is the IANA preferred name for this standard when supplemented with the C0 and C1 control codes from ISO/IEC 6429. Nobody likes a mess like that. That's why the members of Unicode just took the characters sets as they were used in the field when creating mappings between Unicode and other character sets.