Search code examples
phpstringencodingcharacter-encodingstateful

What does it mean by 'state-dependent encodings'? What does it mean by 'same byte values'? What does mean by 'initial and non-initial shift states'?


I'm using Windows 10 Home Single Language Edition which is a 64-bit Operating System on my machine.

I've installed the most latest version of XAMPP which has installed PHP 7.2.6 on my machine.

I come across the following sentence from the paragraph in PHP Manual

I understood most part of the paragraph from PHP Manual. But, I didn't understand the last sentence from the same paragraph which I've mentioned below.

Note, however, that state-dependent encodings where the same byte values can be used in initial and non-initial shift states may be problematic.

I've following questions in context of the paragraph titled Details of the String Type

  1. What does it mean by 'state-dependent encodings' in this context?
  2. What does it mean by 'initial and non-initial shift states' in this context?
  3. What does mean by 'same byte values' that can be used in above mentioned 'initial and non-initial' shift states in this context?. 4.How does the same byte values can be used in 'initial and non-initial shift states' and how it can be problematic?

Solution

  • Some encodings have byte marks which selects how to interpret the next characters (until the next mark).

    So e.g. after a "Japanese" mark the next characters are interpreted as Japanese characters (but e.g. 2 bytes per character), after the mark "latin", the characters are interpreted as latin1.

    So to decode a string, one should keep the state (e.g. which it is the actual interpretation).

    In the above "example", a byte could be interpreted as Japanese or as Latin1, depending on the state. Initially a string has a default state, but if you take a substring, you will miss the "mark", so the string will be interpreted (maybe) with the wrong interpretation.

    So one should copy the status (mark) and prefix it at the beginning of every substring.

    ISO 2022 defines a way to implement such encoding, and you will Find in the Wikipedia article various implementations https://en.wikipedia.org/wiki/ISO/IEC_2022.

    Now such encodings are obsolete. Unicode has surpassed them (especially where 2022 were common, so where encodings were huge problems. Note: UTF-8 is also state dependent (for bytes within a character/codepoint), but UTF-8 was implemented that the state will be reset at every character (and first byte of a character has a predefined range). Unicode keeps also some states, but it is discouraged to use it in unicode (like direction of text: right and left: these should be preferably set by an coding at higher level (e.g. HTML), and not with the discourages unicode direction codes).