Search code examples
unicodeencodingcharacter-encodingunicode-normalization

Is this case a weird UTF-8 encoding conversion?


I am working with a remote application that seems to do some magic with the encoding. The application renders clear responses (which I'll refer as True and False), depending on user input. I know two valid values, that will render 'True', all the others should be 'False'.

What I found (accidently) interesting is, that submitting corrupted value leads to 'True'.

Example input:

USER10 //gives True
USER11 //gives True
USER12 //gives False
USER.. //gives False
OTHERTHING //gives False

so basically only these two first values render True response.

What I noticed is that USER˱0 (hex-wise \x55\x53\x45\x52\C0\xB1\x30) is accepted as True, surprisingly. I did check other hex bytes, with no such success. It leads me to a conclusion that \xC0\xB1 could be somehow translated into 0x31 (='1').

My question is - how it could happen? Is that application performing some weird conversion from UTF-16 (or sth else) to UTF-8?

I'd appreciate any comments/ideas/hints.


Solution

  • C0 is an invalid start byte for a two-byte UTF-8 sequence, but if a bad UTF-8 decoder accepts it C0 B1 would be interpreted as ASCII 31h (the character 1).

    Quoting Wikipedia:

    ...(C0 and C1) could only be used for an invalid "overlong encoding" of ASCII characters (i.e., trying to encode a 7-bit ASCII value between 0 and 127 using two bytes instead of one....