Search code examples
cutf-8bit-shift

Using bit shifting to guess UTF-8 encoding


I am writing a program like file(1) that can guess if a text file contains ascii character, ISO-8859-1 characters, or UTF-8. Ive already programmed it to guess ascii and ISO, only UTF-8 remains. My problem is I am supposed to be using bit-shifting, and while I know the very basics of bit-shifting, I am having trouble figuring it out how to use it for guessing UTF-8 characters. I am of course not asking for a solution, but if someone could push me in the right direction, I would be pleased!

I am writing in C.


Solution

  • Any solution to this is going to be heuristic-based. But in general, UTF-8 has the following byte sequences (available in man utf8):

    0x00000000 - 0x0000007F:
        0xxxxxxx
    0x00000080 - 0x000007FF:
        110xxxxx 10xxxxxx
    0x00000800 - 0x0000FFFF:
        1110xxxx 10xxxxxx 10xxxxxx
    0x00010000 - 0x001FFFFF:
        11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
    

    So your heuristic can look a few bytes ahead, and see if the bytes follow one of four patterns (UTF-8 in theory supports byte sequences stretching to six characters, but in practice only uses four):

    1. 0* (you'll have to be careful to distinguish this from regular ASCII files)
    2. 110*, 10*
    3. 1110*, 10*, 10*
    4. 11110*, 10*, 10*, 10*

    Checking for these is easy:

    To check if a unsigned char a fits one of these patterns, run:

    1. For 10* - the most frequent pattern - use (a >> 6) == 0x2.
    2. For 0* - use (a >> 7) == 0x0.
    3. For 110* - use (a >> 5) == 0x6.
    4. For 1110* - use (a >> 4) == 0xe.
    5. For 11110* - use (a >> 3) == 0x1e.

    All we're doing is shifting the bits to the right and checking if they're equal to the bits in the UTF-8 byte sequences.