Search code examples
cutf-8printfglibc

Workaround for glibc's printf truncation bug in multi-byte locales?


Certain GNU-based OS distros (Debian) are still impacted by a bug in GNU libc that causes the printf family of functions to return a bogus -1 when the specified level of precision would truncate a multi-byte character. This bug was fixed in 2.17 and backported to 2.16. Debian has an archived bug for this, but the maintainers appear to have no intention of backporting the fix to the 2.13 used by Wheezy.

The text below is quoted from https://sourceware.org/bugzilla/show_bug.cgi?id=6530. (Please do not edit the block quoting inline again.)

Here's a simpler testcase for this bug courtesy of Jonathan Nieder:

#include <stdio.h>
#include <locale.h>

int main(void)
{
    int n;

    setlocale(LC_CTYPE, "");
    n = printf("%.11s\n", "Author: \277");
    perror("printf");
    fprintf(stderr, "return value: %d\n", n);
    return 0;
}

Under a C locale that'll do the right thing:

$ LANG=C ./test
Author: &#65533;
printf: Success
return value: 10

But not under a UTF-8 locale, since \277 isn't a valid UTF-8 sequence:

$ LANG=en_US.utf8 ./test
printf: Invalid or incomplete multibyte or wide character

It's worth noting that printf will also overwrite the first character of the output array with \0 in this context.

I am currently trying to retrofit a MUD codebase to support UTF-8, and unfortunately the code is riddled with cases where arbitrary sprintf precision is used to limit how much text is sent to output buffers. This problem is made much worse by the fact that most programmers don't expect a -1 return in this context, which can result in uninitialized memory reads and badness that cascades down from that. (already caught a few cases in valgrind)

Has anyone come up with a concise workaround for this bug in their code that doesn't involve rewriting every single invocation of a formatting string with arbitrary length precision? I'm fine with truncated UTF-8 characters being written to my output buffer as it's fairly trivial to clean that up in my output processing prior to socket write, and it seems like overkill to invest this much effort in a problem that will eventually go away given a few more years.


Solution

  • I'm guessing, and it seems to be confirmed by the the comments to the question, that you don't use all that much of the C library's locale specific functionality. In that case you'd probably be better off not changing the locale to a UTF-8 based one, and leaving it in the single-byte locale your code assumes.

    When you do need to process UTF-8 strings as UTF-8 strings you can use specialized code. It's not too hard to write your own UTF-8 processing routines. You can even download the Unicode Character Database and do some fairly sophisticated character classification. If you'd prefer to use a third party library to handle UTF-8 strings there's ICU as you mentioned in your comments. It's a pretty heavyweight library though, a previous question recommends a few lighter weight alternatives.

    It might also be possible to switch the C locale back and forth as necessary so you can use the C library's functionality. You'll want to check the performance impact of this however, as switching locales can be an expensive operation.