Search code examples
cfile-ioiowchar-twidechar

How multibyte string is converted to wide-character string in fxprintf.c in glibc?


Currently, the logic in glibc source of perror is such:

If stderr is oriented, use it as is, else dup() it and use perror() on dup()'ed fd.

If stderr is wide-oriented, the following logic from stdio-common/fxprintf.c is used:

size_t len = strlen (fmt) + 1;
wchar_t wfmt[len];
for (size_t i = 0; i < len; ++i)
  {
    assert (isascii (fmt[i]));
    wfmt[i] = fmt[i];
  }
res = __vfwprintf (fp, wfmt, ap);

The format string is converted to wide-character form by the following code, which I do not understand:

wfmt[i] = fmt[i];

Also, it uses isascii assert:

assert (isascii(fmt[i]));

But the format string is not always ascii in wide-character programs, because we may use UTF-8 format string, which can contain non-7bit value(s). Why there is no assert warning when we run the following code (assuming UTF-8 locale and UTF-8 compiler encoding)?

#include <stdio.h>
#include <errno.h>
#include <wchar.h>
#include <locale.h>
int main(void)
{
  setlocale(LC_CTYPE, "en_US.UTF-8");
  fwide(stderr, 1);
  errno = EINVAL;
  perror("привет мир");  /* note, that the string is multibyte */
  return 0;
}
$ ./a.out 
привет мир: Invalid argument

Can we use dup() on wide-oriented stderr to make it not wide-oriented? In such case the code could be rewritten without using this mysterious conversion, taking into account the fact that perror() takes only multibyte strings (const char *s) and locale messages are all multibyte anyway.

Turns out we can. The following code demonstrates this:

#include <stdio.h>
#include <wchar.h>
#include <unistd.h>
int main(void)
{
  fwide(stdout,1);
  FILE *fp;
  int fd = -1;
  if ((fd = fileno (stdout)) == -1) return 1;
  if ((fd = dup (fd)) == -1) return 1;
  if ((fp = fdopen (fd, "w+")) == NULL) return 1;
  wprintf(L"stdout: %d, dup: %d\n", fwide(stdout, 0), fwide(fp, 0));
  return 0;
}
$ ./a.out 
stdout: 1, dup: 0

BTW, is it worth posting an issue about this improvement to glibc developers?


NOTE

Using dup() is limited with respect to buffering. I wonder if it is considered in the implementation of perror() in glibc. The following example demonstrates this issue. The output is done not in the order of writing to the stream, but in the order in which the data in the buffer is written-off. Note, that the order of values in the output is not the same as in the program, because the output of fprintf is written-off first (because of "\n"), and the output of fwprintf is written off when program exits.

#include <wchar.h>
#include <stdio.h>
#include <unistd.h>
int main(void)
{
  wint_t wc = L'b';
  fwprintf(stdout, L"%lc", wc);

  /* --- */

  FILE *fp;
  int fd = -1;
  if ((fd = fileno (stdout)) == -1) return 1;
  if ((fd = dup (fd)) == -1) return 1;
  if ((fp = fdopen (fd, "w+")) == NULL) return 1;

  char c = 'h';
  fprintf(fp, "%c\n", c);
  return 0;
}
$ ./a.out 
h
b

But if we use \n in fwprintf, the output is the same as in the program:

$ ./a.out 
b
h

perror() manages to get away with that, because in GNU libc stderr is unbuffered. But will it work safely in programs where stderr is manually set to buffered mode?


This is the patch that I would propose to glibc developers:

diff -urN glibc-2.24.orig/stdio-common/perror.c glibc-2.24/stdio-common/perror.c
--- glibc-2.24.orig/stdio-common/perror.c   2016-08-02 09:01:36.000000000 +0700
+++ glibc-2.24/stdio-common/perror.c    2016-10-10 16:46:03.814756394 +0700
@@ -36,7 +36,7 @@

   errstring = __strerror_r (errnum, buf, sizeof buf);

-  (void) __fxprintf (fp, "%s%s%s\n", s, colon, errstring);
+  (void) _IO_fprintf (fp, "%s%s%s\n", s, colon, errstring);
 }


@@ -55,7 +55,7 @@
      of the stream.  What is supposed to happen when the stream isn't
      oriented yet?  In this case we'll create a new stream which is
      using the same underlying file descriptor.  */
-  if (__builtin_expect (_IO_fwide (stderr, 0) != 0, 1)
+  if (__builtin_expect (_IO_fwide (stderr, 0) < 0, 1)
       || (fd = __fileno (stderr)) == -1
       || (fd = __dup (fd)) == -1
       || (fp = fdopen (fd, "w+")) == NULL)

Solution

  • NOTE: It wasn't easy to find concrete questions in this post; on the whole, the post seems to be an attempt to engage in a discussion about implementation details of glibc, which it seems to me would be better directed to a forum specifically oriented to development of that library such as the libc-alpha mailing list. (Or see https://www.gnu.org/software/libc/development.html for other options.) This sort of discussion is not really a good match for StackOverflow, IMHO. Nonetheless, I tried to answer the questions I could find.

    1. How does wfmt[i] = fmt[i]; convert from multibyte to wide character?

      Actually, the code is:

      assert(isascii(fmt[i]));
      wfmt[i] = fmt[i];
      

      which is based on the fact that the numeric value of an ascii character is the same as a wchar_t. Strictly speaking, this need not be the case. The C standard specifies:

      Each member of the basic character set shall have a code value equal to its value when used as the lone character in an integer character constant if an implementation does not define __STDC_MB_MIGHT_NEQ_WC__. (§7.19/2)

      (gcc does not define that symbol.)

      However, that only applies to characters in the basic set, not to all characters recognized by isascii. The basic character set contains the 91 printable ascii characters as well as space, newline, horizontal tab, vertical tab and form feed. So it is theoretically possible that one of the remaining control characters will not be correctly converted. However, the actual format string used in the call to __fxprintf only contains characters from the basic character set, so in practice this pedantic detail is not important.

    2. Why there is no assert warning when we execute perror("привет мир");?

      Because only the format string is being converted, and the format string (which is "%s%s%s\n") contains only ascii characters. Since the format string contains %s (and not %ls), the argument is expected to be char* (and not wchar_t*) in both the narrow- and wide-character orientations.

    3. Can we use dup() on wide-oriented stderr to make it not wide-oriented?

      That would not be a good idea. First, if the stream has an orientation, it might also have a non-empty internal buffer. Since that buffer is part of the stdio library and not of the underlying Posix fd, it will not be shared with the duplicate fd. So the message printed by perror might be interpolated in the middle of some existing output. In addition, it is possible that the multibyte encoding has shift states, and that the output stream is not currently in the initial shift state. In that case, outputting an ascii sequence could result in garbled output.

      In the actual implementation, the dup is only performed on streams without orientation; these streams have never had any output directed at them, so they are definitely still in the initial shift state with an empty buffer (if the stream is buffered).

    4. Is it worth posting an issue about this improvement to glibc developers?

      That is up to you, but don't do it here. The normal way of doing that would be to file a bug. There is no reason to believe that glibc developers read SO questions, and even if they do, someone would have to copy the issue to a bug, and also copy any proposed patch.