Search code examples
cunicodencurseswchar

Ncurses not writing out the specified number of wide characters (about column needs of a wide character)


In the program below I am attempting to use ncurses to output ten rows of ten Unicode characters each. Each iteration of the loop chooses one random character from an array of three Unicode characters. However the problem that I'm encountering is that ncurses is not always writing ten characters per row... It's kind of hard to explain, but if you run the program maybe you will see that there are empty spaces here and there. Some rows will contain ten characters, some only nine, some only eight. At this point I have no clue what it is that I'm doing wrong.

I am running this program in a Ubuntu 20.04.1 machine and I'm using the default GUI terminal.

#define _XOPEN_SOURCE_EXTENDED 1
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <ncurses.h>

#include <locale.h>
#include <time.h>

#define ITERATIONS 3000
#define REFRESH_DELAY 720000L
#define MAXX 10
#define MAXY 10
#define RANDOM_KANA &katakana[(rand()%3)]
#define SAME_KANA &katakana[2]

void show();

cchar_t katakana[3];
cchar_t kana1;
cchar_t kana2;
cchar_t kana3;

int main() {
  setlocale(LC_ALL, "");
  srand(time(0));

  setcchar(&kana1, L"\u30d0", WA_NORMAL, 5, NULL);
  setcchar(&kana2, L"\u30a6", WA_NORMAL, 4, NULL);
  setcchar(&kana3, L"\u30b3", WA_NORMAL, 4, NULL);
  katakana[0] = kana1;
  katakana[1] = kana2;
  katakana[2] = kana3;
  
  initscr();
  for (int i=0; i < ITERATIONS; i++) {
    show();
    usleep(REFRESH_DELAY);
  }
}

void show() {
  for (int x=0; x < MAXX; x++) {
    for (int y = 0; y < MAXY; y++) {
      mvadd_wch(y, x, RANDOM_KANA);
    }
  }
  refresh();
  //getch();
}


Solution

  • TL;DR: The basic problem is that katakana (and many other Unicode characters) are what are often called "double-width characters" because they occupy two columns in a monospaced terminal font.

    So if you place バ in column 0 of a display, you need to place the next character at column 2, not column 1. That's not what you're doing; you're attempting to place the next character at column 1, partly overlapping the バ, and that's undefined behaviour both from the perspective of the ncurses library and the terminal emulator being used for display.

    So you should change the line

          mvadd_wch(y, x, RANDOM_KANA);
    

    to

          mvadd_wch(y, 2*x, RANDOM_KANA);
    

    to take into account the fact that the katakanas occupy two columns. That will tell ncurses to put every character at the column it is supposed to be at, which avoids the overlap problem. If you do that, your screens display as neat 10x10 matrices.

    Note that this usage of "width" (that is, the width of the displayed character) has very little to do with the C concept of "wide characters" (wchar_t), which is the number of bytes it takes to store the character. Non-English Latin alphabet characters and characters in the Greek, Cyrillic, Arabic, Hebrew and other alphabets are displayed in a single column but must be stored in a wchar_t or a multibyte encoding.

    Keep that distinction in mind when you read the longer answer, below.

    Also, calling these characters "double-width" is Eurocentric; in terms of the Asian writing systems (and the Unicode standard), East Asian characters (including emoji) are classified as either "halfwidth" or "fullwidth" (or "normal width"), since the normal characters are the (visually) wide ones.


    The problem is certainly as you describe, although the details depend on the terminal. Unfortunately, it doesn't seem possible to illustrate the problem without a screenshot, so I'm including one. This is what it looks like in two of the terminal emulators I happened to have kicking around; the console is shown after the second screen (since, as we'll see, the first screen always displays as expected). On the left is KDE's Konsole; on the right, gnome-terminal. Most terminal emulators are more similar to gnome-terminal, but not all.

    Two terminal emulators showing misplaced characters

    In both cases you can see the ragged right margin, but there is a difference: on the left there are ten characters in every row but some of them seem misplaced. On some lines, a character is overlapping the previous character, shifting the line over. On the right, the overlapped characters are not displayed, so some of the lines have fewer than ten characters. But the characters which are displayed on those lines show the same half-character shifts.

    The problem here is that the katakanas are all "double-width" characters; that is, they take up two adjacent terminal cells. I left my prompt in the screenshots (something I very rarely do) so you can see how the katakanas occupy the same space as two latin characters.

    Now, you are using mvadd_wch to display each character at a screen co-ordinate you provide. But most of the screen coordinates you provide are impossible because they force double-width characters to overlap. For example, you place the first character on each line in column 0; it occupies columns 0 and 1 (because it is double-width). You then place the next character on column 1 of the same line, overlapping the first character.

    That's undefined behaviour. What actually happens on the first screen is probably OK in most applications: since ncurses doesn't try to back output up half a double-width character, each character ends up being output right after the previous character on the same line so on the first screen the katakanas line up perfectly, each of them taking two spots. So the visuals are fine, but there is an underlying problem: ncurses records the katakanas as being in columns 0, 1, 2, 3..., but the characters are actually in columns 0, 2, 4, 6,...

    When you start overwriting the first screen with the next 10x10 block, this problem becomes visible. Since ncurses records which character is at each row and column, which lets it optimise mvadd_wch by not displaying characters which haven't changed, something which happens occasionally in your random blocks, and frequently in most ncurses applications. But of course, although it doesn't have to display a character which is already displayed, it does have to place the next character at the column it's supposed to occupy. So it needs to output a cursor move code. But since characters are not actually displayed at the columns ncurses thinks they're at, it doesn't compute the correct move code.

    Take the second line as an example: ncurses has determined that there is no need to change the character at column 0, because it hasn't changed. However, the character you've asked it to display at column 1 has changed. So ncurses outputs a "move right one character" console code in order to write the second character at column 1, overlapping both the character which was previously at column 0 and the character previously at column 2. As the screenshot shows, Konsole attempts to show the overlap, and gnome-terminal erases the overlapped character. (It's undefined behaviour to overlap characters, so either of these are reasonable.) Both of them then show the second character at column 1.

    OK, that's the long and possibly confusing explanation.

    And the immediate solution is at the beginning of this answer. But it may well not be a complete solution, because this is probably a highly simplified version of your final program. It's quite likely that your real program will need to compute column numbers in a less simplistic way. You'll need to be aware of the actual column widths of each character you output, and use that information to compute the correct placements.

    It's possible that you just know how wide each character is. (For example, if all characters are katakana, or all characters are latin, it's easy.) But it's often the case that you don't know for certain, so you might find it useful to ask the C library to tell you how many columns each characters take. You can do that with the wcwidth function. (See the link for details, or try man wcwidth at your console.)

    But there's a big caveat here: wcwidth will tell you the width of the character as stored in the current locale. In Unicode locales, the result will always be 0, 1 or 2 for characters included in the locale, and -1 for character codes which don't correspond to the characters for which the locale has information. 0 is used for most combining accents as well as control characters which don't move the cursor, and 2 is used for East Asian fullwidth characters.

    That's all fine, but the C library doesn't consult with the terminal emulator. (There's no way to do that, since the terminal emulator is a different program; indeed, it might not even be on the same computer.) So the library must assume that you've configured the terminal emulator with the same information as you used to configure the locale. (I know that's a bit unfair. "You" probably did no more than install a Linux distro, and all of the configurations were done by the various hackers who put together the software gathered into the distribution. They also didn't coordinate with each other.)

    Most of the time this works. But there are always a few characters whose widths are not configured correctly. Usually, this is because the character is in the font being used by the terminal emulator, but is not considered a valid character by the locale; wcwidth then returns -1 and the caller needs to guess which width to use. Incorrect guesses create problems similar to the one discussed in this answer. So you may run into the occasional glitch.

    If you do (or even if you just want to explore your locale a bit), you can use the tools and techniques from this earlier SO answer.

    Finally, since Unicode 9, there is a control character which can force the following character to be fullwidth, in addition to other contextual rules which can change the rendering of a character. So it's no longer even possible to determine the column width of a character without looking at the context and understanding a lot more than you want to know about Unicode East Asian width rules. This makes wcwidth even less general than it used to be.