When I use set_word_breaks_utf32()
from the libunibreak library to navigate through words, I see that single letter words (i.e. 'a' in English, '北' in Chinese, ...) disappear because they always evaluate to WORDBREAK_BREAK and are consequently indistinguishable from surrounding whitespace. The following code demonstrates the issue:
#include <stdio.h>
#include "wordbreak.h"
int main(int argc, const char* argv[]) {
int i;
uint32_t text[] = { 'T', 'h', 'i', 's', ' ', 'i', 's', ' ', 'a', ' ', 't', 'e', 's', 't', '.', '\n' };
char breaks[1024];
size_t length = sizeof(text) / sizeof(text[0]);
set_word_breaks_utf32(text, length, "", breaks);
for(i = 0; i < length; i++) putchar(text[i]);
for(i = 0; i < length; i++) putchar(breaks[i] + '0');
putchar('\n');
return 0;
}
The output of this code shows clearly that the letter 'a' is indistinguishable from the surrounding whitespace:
This is a test.
1110010000111000
What can I do to ensure that the boundaries of single letter words are distinguishable in set_word_breaks_utf32()
output?
[Apologies for using the line-breaks
tag, but the word-break
tag is related to a CSS property.]
The Unicode Standard Annex #29 isn't really designed for that. What set_wordbreaks_utf32()
does is find each word boundary.
This is a test.
1110010000111000
T h i s ' ' i s ' ' a ' ' t e s t . '\n'
| _ _ _ | | _ | | | | _ _ _ | | |
Each |
above is a word boundary, which can be helpful to find words, but is not the complete solution. Note that there is an implicit word boundary at the beginning of the string. A complete word detection algorithm will have to determine if a character between each adjacent word boundary is a unicode letter, and mark that character as a word accordingly.