I am writing a program like file(1) that can guess if a text file contains ascii character, ISO-8859-1 characters, or UTF-8. Ive already programmed it to guess ascii and ISO, only UTF-8 remains. My problem is I am supposed to be using bit-shifting, and while I know the very basics of bit-shifting, I am having trouble figuring it out how to use it for guessing UTF-8 characters. I am of course not asking for a solution, but if someone could push me in the right direction, I would be pleased!
I am writing in C.
Any solution to this is going to be heuristic-based. But in general, UTF-8 has the following byte sequences (available in man utf8
):
0x00000000 - 0x0000007F:
0xxxxxxx
0x00000080 - 0x000007FF:
110xxxxx 10xxxxxx
0x00000800 - 0x0000FFFF:
1110xxxx 10xxxxxx 10xxxxxx
0x00010000 - 0x001FFFFF:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
So your heuristic can look a few bytes ahead, and see if the bytes follow one of four patterns (UTF-8 in theory supports byte sequences stretching to six characters, but in practice only uses four):
0*
(you'll have to be careful to distinguish this from regular ASCII files)110*
, 10*
1110*
, 10*
, 10*
11110*
, 10*
, 10*
, 10*
Checking for these is easy:
To check if a unsigned char a
fits one of these patterns, run:
10*
- the most frequent pattern - use (a >> 6) == 0x2
.0*
- use (a >> 7) == 0x0
.110*
- use (a >> 5) == 0x6
.1110*
- use (a >> 4) == 0xe
.11110*
- use (a >> 3) == 0x1e
.All we're doing is shifting the bits to the right and checking if they're equal to the bits in the UTF-8 byte sequences.