I am writing KMP sub-string searching alg on unicode strings in C using UCharIterators, the problem I am facing is that I need to compare values by iterator and comparison should be normalized, while all of the ICU colls absorb strings and not individual chars.
UCharIterator first_iter, second_iter
uiter_setUTF8( &first_iter, needle_str, n_needle_bytes);
uiter_setUTF8(&second_iter, needle_str, n_needle_bytes);
...
if (firts_iter.current(&first_iter) != second_iter.current(&second_iter)) {
...
the current condition fails on 'a' and 'ä' while I don't want it too. I don't like the idea of pre-normalization as it requires O(n + m) additional memory (to the best of my knowledge ICU doesn't have a function to do it in-place)
I had to switch to U8_* macro for UTF-8 ICU strings.
Moved offset with U8_NEXT
U8_NEXT((uint8_t *)string, string_offset, string_size, status);
And compared like this
U8_GET((uint8_t *)key, 0, first_key_end, key_size, first_key_c);
U8_GET((uint8_t *)key, 0, second_key_end, key_size, second_key_c);
if (coll->cmp(key + first_key_end, U8_LENGTH(first_key_c),
key + second_key_end, U8_LENGTH(second_key_c),
coll)
that is, calculated length of a single letter with U8_LENGTH
by the first code point (and not offset or part of a string).
More on that here https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/utf8_8h.html