Using bit shifting to guess UTF-8 encoding

I am writing a program like file(1) that can guess if a text file contains ascii character, ISO-8859-1 characters, or UTF-8. Ive already programmed it to guess ascii and ISO, only UTF-8 remains. My problem is I am supposed to be using bit-shifting, and while I know the very basics of bit-shifting, I am having trouble figuring it out how to use it for guessing UTF-8 characters. I am of course not asking for a solution, but if someone could push me in the right direction, I would be pleased!

I am writing in C.

Solution

Any solution to this is going to be heuristic-based. But in general, UTF-8 has the following byte sequences (available in man utf8):

0x00000000 - 0x0000007F:
    0xxxxxxx
0x00000080 - 0x000007FF:
    110xxxxx 10xxxxxx
0x00000800 - 0x0000FFFF:
    1110xxxx 10xxxxxx 10xxxxxx
0x00010000 - 0x001FFFFF:
    11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

So your heuristic can look a few bytes ahead, and see if the bytes follow one of four patterns (UTF-8 in theory supports byte sequences stretching to six characters, but in practice only uses four):

0* (you'll have to be careful to distinguish this from regular ASCII files)
110*, 10*
1110*, 10*, 10*
11110*, 10*, 10*, 10*

Checking for these is easy:

To check if a unsigned char a fits one of these patterns, run:

For 10* - the most frequent pattern - use (a >> 6) == 0x2.
For 0* - use (a >> 7) == 0x0.
For 110* - use (a >> 5) == 0x6.
For 1110* - use (a >> 4) == 0xe.
For 11110* - use (a >> 3) == 0x1e.

All we're doing is shifting the bits to the right and checking if they're equal to the bits in the UTF-8 byte sequences.