Search code examples

Boost Spirit (X3) symbol tables resulting in UTF8 strings

I'm trying to parse LaTeX escape codes (e.g. \alpha) to the Unicode (Mathematical) characters (i.e. U+1D6FC).

Right now this means I am using this symbols parser (rule):

struct greek_lower_case_letters_ : x3::symbols<char32_t>
    add("alpha",   U'\u03B1');
} greek_lower_case_letter;

This works fine but means I'm getting a std::u32string as a result. I'd like an elegant way to keep the Unicode code points in the code (for maybe future automation) and maintenance reasons. Is there a way to get this kind of parser to parse into a UTF-8 std::string?

I thought of making the symbols struct parse to a std::string, but that would be highly inefficient (I know, premature optimization bla bla).

I was hoping there was some elegant way instead of going through a bunch of hoops to get this working (symbols appending strings to the result).

I do fear though that using the code point values and wanting UTF8 will incur a runtime cost of the conversion (or is there a constexpr UTF32->UTF8 conversion possibe?).


  • The JSON parser example at cierelabs shows an approach that uses semantic actions to append code points in utf8 encoding:

      auto push_utf8 = [](auto& ctx)
         typedef std::back_insert_iterator<std::string> insert_iter;
         insert_iter out_iter(_val(ctx));
         boost::utf8_output_iterator<insert_iter> utf8_iter(out_iter);
         *utf8_iter++ = _attr(ctx);
      // ...
      auto const escape =
             ('u' > hex4)           [push_utf8]
         |   char_("\"\\/bfnrt")    [push_esc]

    This is used in their

    typedef x3::rule<unicode_string_class, std::string> unicode_string_type;

    Which, as you can see, build the utf8 sequence into a std::string attribute.

    See for full code: