Search code examples
cunicodec89

How to read UTF-8 string given its length in characters in plain C89?


I'm writing a custom cross-platform minimalistic TCP server in plain C89. (But I will also accept POSIX-specific answer.)

The server works with UTF-8 strings, but never looks inside them. It treats all strings as immutable binary blobs.

But now I need to accept UTF-8 strings from the client that does not know how to calculate their size in bytes. The client can only transmit string length in characters. (Update: The client is in JavaScript, and "length in characters" is, in fact, whatever String.length() returns. I assume it is actual UTF-8 characters, not something else.)

I do not want to add heavy dependencies to my tiny server. Is there a robust and neat way to read this datagram? (For the sake of this question, let's say that it is read from FILE *.)

U<CRLF>       ; data type marker (actually read by dispatching code)
<SIZE><CRLF>  ; UTF-8 string size in characters
<DATA><CRLF>  ; data blob

Example:

U
7
Юникод!

Update:

One batch of data can contain more than one datagram, so approximate reads would not work, I need to read exact amount of characters.

And the actual UTF-8 data may contain any characters, so I can't pick a character as a terminator — I don't want mess with escaping it in the data.


Solution

  • This looks like exactly the thing I'd need. Wish I found it earlier:

    http://bjoern.hoehrmann.de/utf-8/decoder/dfa/