Search code examples
c++unicodefmtstdformat

Is std::format going to work with things like ICU UnicodeString?


Rather than a long preamble, here is my core question, up front. The paragraphs below explain in more detail.

Is there a template parameter in std::format (or fmt) that will allow me to format into ICU UnicodeStrings?, or perhaps into something like char16_t[] or std::basic_string<char16_t>, while using a unicode library to deal with things like encoding and grapheme clusters?

More Explanation, Background

I see the C++20 standard has this std::format library component for formatting strings. (It's late in 2022 and I still can't use it my compiler (clang from Xcode 14), and I'm curious about the cause of the delay, but that's another question.)

I've been using this fmt library, which looks like a simpler preview of the official one.

int x = 10;
fmt::print("x is {}", x);

I've also been using ICU's UnicodeString class. It lets me correctly handle all languages and character types, from ASCII to Chinese characters to emojis.

I don't expect the fmt library to aware of Unicode out of the box. That would require that it build and link with ICU, or something like it. Here's an example of how it's not:

void testFormatUnicodeWidth() {
    // Two ways to write the Spanish word "está". 
    char *s1 = "est\u00E1";  // U+00E1 : Latin small letter A with acute
    char *s2 = "esta\u0301"; // U+0301 : Combining acute accent
    fmt::print("s1 = {}, length = {}\n", s1, strlen(s1));
    fmt::print("s2 = {}, length = {}\n", s2, strlen(s2));
    fmt::print("|{:8}|\n", s1);
    fmt::print("|{:8}|\n", s2);
}

That prints:

s1 = está, length = 5
s2 = está, length = 6
|está    |
|está   |

To make that width specifier work the way I want, to look nice on the screen, I could use ICU's classes, which can iterate over the visible characters ("grapheme clusters") of a string.

I don't expect std::format to require Unicode either. From what I can tell the C++ standard people create things that can run on small embedded devices. That's cool. But I'm asking if there will also be a way for me to integrate the two, so that I don't have a split world, between:

  1. C++'s strings and format.
  2. ICU strings if I want things to look right on screen.

Solution

  • {fmt} doesn't support ICU UnicodeString directly but you can easily write your own formatting function that does. For example:

    #include <fmt/xchar.h>
    #include <unistr.h>
    
    template <typename... T>
    auto format(fmt::wformat_string<T...> fmt, T&&... args) -> UnicodeString {
      auto s = fmt::format(fmt, std::forward<T>(args)...);
      return {s.data(), s.size()};
    }
    
    int main() {
      UnicodeString s = format(L"The answer is {}.", 42);
    }
    

    Note that {fmt} supports Unicode but width estimation works on code points (like Python's str.format) instead of grapheme clusters at the moment. It will be addressed in one of the future releases.