I am making a library which lets user insert and search key-value pairs as trie data structure. When I insert a unicode string, it breaks down into 4 characters(utf-8)(which is okay), but each character becomes ‘?’. So I tried using setlocale(LC_ALL, "")
which didn’t work (or maybe I just dunno what are the right arguments for my case and where to call it). I don’t really care about printing or reading the character as it is. All I want is that it can somehow be represented uniquely.
In my trie there are links like node *next[256]
.
So all I want is when a unicode string gets inserted, it gets inserted as a unique combination which would make it possible to search that string uniquely. Also I want a way to detect that a unicode character was broken down into 4 individual chars. Thats because, e.g., If in a string “wxyz" a unicode character “x” is broken down into a,b,c,d then trie would store “wabcdyz”. But if I was actually searching a string wabcdyz(not unicode), then it would find the entry for that string but that would be a mismatch.
Here is a program that shows the unicode character being broken down into four ?
characters:
#include <stdio.h>
int main()
{
printf("Hello World");
char a[] = "Ƃ";
int i;
for(i = 0 ; a[i] != '\0' ; ++i)
{
printf("%c", a[i]);
}
return 0;
}
UTF-8 is a mechanism for encoding Unicode character sequences as byte sequences, but not the only way. Unicode does not imply UTF-8, and, technically, UTF-8 does not imply Unicode, either.
When I insert a unicode string, it breaks down into 4 characters(utf-8)
That is a function of how you store the string data, and
So all I want is when a unicode string gets inserted, it gets inserted as a unique combination which would make it possible to search that string uniquely.
That's relatively easy: encode all your strings the same way. Encoding all of them in UTF-8 would be my choice, but you may also use any other stateless encoding that supports all the characters that may appear in your strings, such as UTF-16 or UTF-32. But you must use a consistent encoding for all characters of all strings.
Having done that properly, you don't necessarily need to do anything else special to make your trie work.* If you choose UTF-16 or UTF-32, however, then I would suggest structuring the trie around the size of their code units (16- or 32 bits, respectively). That's not necessary, but it will likely yield advantages in the form of shallower and therefore better-performing tries.
* Note, however, that UTF-16 and UTF-32 code units include many encompassing bytes with value 0, such as 0x0031 and 0x00000200. If you do treat these as byte sequences instead of code-unit sequences, then you must account for that. In particular, you must avoid assuming that individual null bytes serve as terminators.