Search code examples
cunicodeline-breaks

Cannot distinguish single character words with libunibreak


When I use set_word_breaks_utf32() from the libunibreak library to navigate through words, I see that single letter words (i.e. 'a' in English, '北' in Chinese, ...) disappear because they always evaluate to WORDBREAK_BREAK and are consequently indistinguishable from surrounding whitespace. The following code demonstrates the issue:

#include <stdio.h>
#include "wordbreak.h"

int main(int argc, const char* argv[]) {
    int i;
    uint32_t text[] = { 'T', 'h', 'i', 's', ' ', 'i', 's', ' ', 'a', ' ', 't', 'e', 's', 't', '.', '\n' };
    char breaks[1024];
    size_t length = sizeof(text) / sizeof(text[0]);
    set_word_breaks_utf32(text, length, "", breaks);
    for(i = 0; i < length; i++) putchar(text[i]);
    for(i = 0; i < length; i++) putchar(breaks[i] + '0');
    putchar('\n');
    return 0;
}

The output of this code shows clearly that the letter 'a' is indistinguishable from the surrounding whitespace:

This is a test.
1110010000111000

What can I do to ensure that the boundaries of single letter words are distinguishable in set_word_breaks_utf32() output?

[Apologies for using the line-breaks tag, but the word-break tag is related to a CSS property.]


Solution

  • The Unicode Standard Annex #29 isn't really designed for that. What set_wordbreaks_utf32() does is find each word boundary.

    This is a test.
    1110010000111000
    
      T   h   i   s  ' '  i   s  ' '  a  ' '  t   e   s   t   .  '\n'
    |   _   _   _   |   |   _   |   |   |   |   _   _   _   |   |    |
    

    Each | above is a word boundary, which can be helpful to find words, but is not the complete solution. Note that there is an implicit word boundary at the beginning of the string. A complete word detection algorithm will have to determine if a character between each adjacent word boundary is a unicode letter, and mark that character as a word accordingly.