Search code examples
cunicodetrie

How to read a unicode string in form of individual ascii chars and detect that it was actually unicode in the fastest way possible?


I am making a library which lets user insert and search key-value pairs as trie data structure. When I insert a unicode string, it breaks down into 4 characters(utf-8)(which is okay), but each character becomes ‘?’. So I tried using setlocale(LC_ALL, "") which didn’t work (or maybe I just dunno what are the right arguments for my case and where to call it). I don’t really care about printing or reading the character as it is. All I want is that it can somehow be represented uniquely.

In my trie there are links like node *next[256].

So all I want is when a unicode string gets inserted, it gets inserted as a unique combination which would make it possible to search that string uniquely. Also I want a way to detect that a unicode character was broken down into 4 individual chars. Thats because, e.g., If in a string “wxyz" a unicode character “x” is broken down into a,b,c,d then trie would store “wabcdyz”. But if I was actually searching a string wabcdyz(not unicode), then it would find the entry for that string but that would be a mismatch.

Here is a program that shows the unicode character being broken down into four ? characters:

#include <stdio.h>

int main()
{
    printf("Hello World");

    char a[] = "Ƃ";

    int i;
    for(i = 0 ; a[i] != '\0' ; ++i)
    {
        printf("%c", a[i]);
    }

    return 0;
}

Solution

  • UTF-8 is a mechanism for encoding Unicode character sequences as byte sequences, but not the only way. Unicode does not imply UTF-8, and, technically, UTF-8 does not imply Unicode, either.

    When I insert a unicode string, it breaks down into 4 characters(utf-8)

    That is a function of how you store the string data, and

    • it sounds broken
    • it is probably not using UTF-8, contrary to your assertion

    So all I want is when a unicode string gets inserted, it gets inserted as a unique combination which would make it possible to search that string uniquely.

    That's relatively easy: encode all your strings the same way. Encoding all of them in UTF-8 would be my choice, but you may also use any other stateless encoding that supports all the characters that may appear in your strings, such as UTF-16 or UTF-32. But you must use a consistent encoding for all characters of all strings.

    Having done that properly, you don't necessarily need to do anything else special to make your trie work.* If you choose UTF-16 or UTF-32, however, then I would suggest structuring the trie around the size of their code units (16- or 32 bits, respectively). That's not necessary, but it will likely yield advantages in the form of shallower and therefore better-performing tries.


    * Note, however, that UTF-16 and UTF-32 code units include many encompassing bytes with value 0, such as 0x0031 and 0x00000200. If you do treat these as byte sequences instead of code-unit sequences, then you must account for that. In particular, you must avoid assuming that individual null bytes serve as terminators.