I'm writing a custom cross-platform minimalistic TCP server in plain C89. (But I will also accept POSIX-specific answer.)
The server works with UTF-8 strings, but never looks inside them. It treats all strings as immutable binary blobs.
But now I need to accept UTF-8 strings from the client that does not know how to calculate their size in bytes. The client can only transmit string length in characters. (Update: The client is in JavaScript, and "length in characters" is, in fact, whatever String.length()
returns. I assume it is actual UTF-8 characters, not something else.)
I do not want to add heavy dependencies to my tiny server. Is there a robust and neat way to read this datagram? (For the sake of this question, let's say that it is read from FILE *
.)
U<CRLF> ; data type marker (actually read by dispatching code)
<SIZE><CRLF> ; UTF-8 string size in characters
<DATA><CRLF> ; data blob
Example:
U
7
Юникод!
Update:
One batch of data can contain more than one datagram, so approximate reads would not work, I need to read exact amount of characters.
And the actual UTF-8 data may contain any characters, so I can't pick a character as a terminator — I don't want mess with escaping it in the data.
This looks like exactly the thing I'd need. Wish I found it earlier: