Search code examples
cicu

How to compare ICU iterators values?


I am writing KMP sub-string searching alg on unicode strings in C using UCharIterators, the problem I am facing is that I need to compare values by iterator and comparison should be normalized, while all of the ICU colls absorb strings and not individual chars.

UCharIterator first_iter, second_iter

uiter_setUTF8( &first_iter, needle_str, n_needle_bytes);
uiter_setUTF8(&second_iter, needle_str, n_needle_bytes);

...
if (firts_iter.current(&first_iter) != second_iter.current(&second_iter)) {
    ...

the current condition fails on 'a' and 'ä' while I don't want it too. I don't like the idea of pre-normalization as it requires O(n + m) additional memory (to the best of my knowledge ICU doesn't have a function to do it in-place)


Solution

  • I had to switch to U8_* macro for UTF-8 ICU strings. Moved offset with U8_NEXT

    U8_NEXT((uint8_t *)string, string_offset, string_size, status);
    

    And compared like this

    U8_GET((uint8_t *)key, 0,  first_key_end, key_size,  first_key_c);
    U8_GET((uint8_t *)key, 0, second_key_end, key_size, second_key_c);
    if (coll->cmp(key +  first_key_end, U8_LENGTH(first_key_c),
                  key + second_key_end, U8_LENGTH(second_key_c),
                  coll)
    

    that is, calculated length of a single letter with U8_LENGTH by the first code point (and not offset or part of a string). More on that here https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/utf8_8h.html