I have a UTF-8 string given as a null-terminated const char*
. I would like to know if the first letter of this string is an a
by itself. The following code
bool f(const char* s) {
return s[0] == 'a';
}
is wrong, as the first letter (grapheme cluster) of the string might be à
- made from 2 unicode scalar values: a
and `
. So this very simple question seems extremely difficult to answer, unless you know how grapheme clusters are made.
Still, many libraries parse UTF-8 files (YAML files, for instance) and therefore should be able to answer this kind of question. But these libraries don't seem to depend upon a Unicode library.
So my question are:
How to write code that checks if a string starts with the letter a
?
Assuming that there is no simple answer to the first question, how do parsers (such as YAML parsers) manage to parse files without being able to answer this kind of question?
It simply doesn't matter.
Consider: Is this string valid JSON?
"̀"
(That's the byte sequence 22 cc 80 22
.)
You seem to be arguing that it is not: Since a JSON string should start with "
(QUOTATION MARK) but instead this starts with "̀
(QUOTATION MARK + COMBINING GRAVE ACCENT).
The only reasonable response is that you're thinking at the wrong level: Text serialization is defined in terms of code points. Grapheme clusters are only considered for processing natural language and editing text.
And this certainly is considered valid JSON.
>>> json.loads(bytes.fromhex('22cc8022'))
'̀'