Search code examples
c++boostc++14boost-spiritboost-spirit-x3

Boost Spirit (X3) symbol tables resulting in UTF8 strings


I'm trying to parse LaTeX escape codes (e.g. \alpha) to the Unicode (Mathematical) characters (i.e. U+1D6FC).

Right now this means I am using this symbols parser (rule):

struct greek_lower_case_letters_ : x3::symbols<char32_t>
{
  greek_lower_case_letters_::greek_lower_case_letters_()
  {
    add("alpha",   U'\u03B1');
  }
} greek_lower_case_letter;

This works fine but means I'm getting a std::u32string as a result. I'd like an elegant way to keep the Unicode code points in the code (for maybe future automation) and maintenance reasons. Is there a way to get this kind of parser to parse into a UTF-8 std::string?

I thought of making the symbols struct parse to a std::string, but that would be highly inefficient (I know, premature optimization bla bla).

I was hoping there was some elegant way instead of going through a bunch of hoops to get this working (symbols appending strings to the result).

I do fear though that using the code point values and wanting UTF8 will incur a runtime cost of the conversion (or is there a constexpr UTF32->UTF8 conversion possibe?).


Solution

  • The JSON parser example at cierelabs shows an approach that uses semantic actions to append code points in utf8 encoding:

      auto push_utf8 = [](auto& ctx)
      {
         typedef std::back_insert_iterator<std::string> insert_iter;
         insert_iter out_iter(_val(ctx));
         boost::utf8_output_iterator<insert_iter> utf8_iter(out_iter);
         *utf8_iter++ = _attr(ctx);
      };
    
      // ...
    
      auto const escape =
             ('u' > hex4)           [push_utf8]
         |   char_("\"\\/bfnrt")    [push_esc]
         ;
    

    This is used in their

    typedef x3::rule<unicode_string_class, std::string> unicode_string_type;
    

    Which, as you can see, build the utf8 sequence into a std::string attribute.

    See for full code: https://github.com/cierelabs/json_spirit/blob/x3_devel/ciere/json/parser/x3_grammar_def.hpp