Search code examples
c#unicodeutf-8utf-16utf8-decode

How do I accomplish random reads of a UTF8 file


My understanding is that reads to a UTF8 or UTF16 Encoded file can't necessarily be random because of the occasional surrogate byte (used in Eastern languages for example).

How can I use .NET to skip to an approximate position within the file, and read the unicode text from a semi-random position?

Do I discard surrogate bytes and wait for a word break to continue reading? If so, what are the valid word breaks I should wait for until I start the decoding?


Solution

  • Easy, UTF-8 is self-synchronizing.
    Simply jump to random byte in a file and skip-read all bytes with leading bits 10 (continuation bytes). The first byte that does not have leading 10 is the starting byte of a proper UFT-8 character and you can read the following bytes using a regular UTF-8 encoding.