Search code examples
c++11boostboost-spiritboost-spirit-qi

Boost Spirit: Sub-grammar appending to string?


I am toying with Boost.Spirit. As part of a larger work I am trying to construct a grammar for parsing C/C++ style string literals. I encountered a problem:

How do I create a sub-grammar that appends a std::string() result to the calling grammar's std::string() attribute (instead of just a char?

Here is my code, which is working so far. (Actually I already got much more than that, including stuff like '\n' etc., but I cut it down to the essentials.)

#define BOOST_SPIRIT_UNICODE

#include <string>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix_operator.hpp>

using namespace boost;
using namespace boost::spirit;
using namespace boost::spirit::qi;

template < typename Iterator >
struct EscapedUnicode : grammar< Iterator, char() > // <-- should be std::string
{
    EscapedUnicode() : EscapedUnicode::base_type( escaped_unicode )
    {
        escaped_unicode %= "\\" > ( ( "u" >> uint_parser< char, 16, 4, 4 >() )
                                  | ( "U" >> uint_parser< char, 16, 8, 8 >() ) );
    }

    rule< Iterator, char() > escaped_unicode;  // <-- should be std::string
};

template < typename Iterator >
struct QuotedString : grammar< Iterator, std::string() >
{
    QuotedString() : QuotedString::base_type( quoted_string )
    {
        quoted_string %= '"' >> *( escaped_unicode | ( char_ - ( '"' | eol ) ) ) >> '"';
    }

    EscapedUnicode< Iterator > escaped_unicode;
    rule< Iterator, std::string() > quoted_string;
};

int main()
{
    std::string input = "\"foo\u0041\"";
    typedef std::string::const_iterator iterator_type;
    QuotedString< iterator_type > qs;
    std::string result;
    bool r = parse( input.cbegin(), input.cend(), qs, result );
    std::cout << result << std::endl;
}

This prints fooA -- the QuotedString grammar calls the EscapedUnicode grammar, which results in a char being added to the std::string attribute of QuotedString (the A, 0x41).

But of course I would need to generate a sequence of chars (bytes) for anything beyond 0x7f. EscapedUnicode would neet to produce a std::string, which would have to be appended to the string generated by QuotedString.

And that is where I've met a roadblock. I do not understand the things Boost.Spirit does in concert with Boost.Phoenix, and any attempts I have made resulted in lengthy and pretty much undecipherable template-related compiler errors.

So, how can I do this? The answer need not actually do the proper Unicode conversion; it's the std::string issue I need a solution for.


Solution

  • A few points applied:

    • please do not blanket using namespace in relation to highly generic code. ADL will ruin your day unless you control it
    • Operator %= is auto-rule assignment, meaning that automatic attribute propagation will be forced even in the presence of semantic actions. You don't want that because the attribute exposed by uint_parser will not be (correctly) automatically propagated if you want to encode into multi-byte string representation.
    • The input string

      std::string input = "\"foo\u0041\"";
      

      needed to be

      std::string input = "\"foo\\u0041\"";
      

      otherwise the compiler did the escape handling before the parser even runs :)

    Here come the specific tricks for the meat of the task:

    • You will want to change the rule's declared attribute to something that Spirit will automatically "flatten" in simple sequences. E.g.

      quoted_string = '"' >> *(escaped_unicode | (qi::char_ - ('"' | qi::eol))) >> '"';
      

      Will not append because the first branch of the alternate results in a sequence of char, and the second in a single char. The following spelling of the equivalent:

      quoted_string = '"' >> *(escaped_unicode | +(qi::char_ - ('"' | qi::eol | "\\u" | "\\U"))) >> '"';
      

      subtly triggers the appending heuristic in Spirit, so we can achieve what we want without involving Semantic Actions.

    The rest is straight-forward:

    • implement the actual encoding with a Phoenix function object:

      struct encode_f {
          template <typename...> struct result { using type = void; };
      
          template <typename V, typename CP> void operator()(V& a, CP codepoint) const {
              // TODO implement desired encoding (e.g. UTF8)
              bio::stream<bio::back_insert_device<V> > os(a);
              os << "[" << std::hex << std::showbase << std::setw(std::numeric_limits<CP>::digits/4) << std::setfill('0') << codepoint << "]";
          }
      };
      boost::phoenix::function<encode_f> encode;
      

      This you can then use like:

      escaped_unicode = '\\' > ( ("u" >> uint_parser<uint16_t, 16, 4, 4>() [ encode(_val, _1) ])
                               | ("U" >> uint_parser<uint32_t, 16, 8, 8>() [ encode(_val, _1) ]) );
      

      Because you mentioned you don't care about the specific encoding, I elected to encode the raw codepoint in 16bit or 32bit hex representation like [0x0041]. I pragmatically used Boost Iostreams which is capable of directly writing into the attribute's container type

    • Use BOOST_SPIRIT_DEBUG* macros

    Live On Coliru

    //#define BOOST_SPIRIT_UNICODE
    //#define BOOST_SPIRIT_DEBUG
    
    #include <string>
    #include <boost/spirit/include/qi.hpp>
    #include <boost/spirit/include/phoenix.hpp>
    
    // for demo re-encoding
    #include <boost/iostreams/device/back_inserter.hpp>
    #include <boost/iostreams/stream.hpp>
    #include <iomanip>
    
    namespace qi  = boost::spirit::qi;
    namespace bio = boost::iostreams;
    namespace phx = boost::phoenix;
    
    template <typename Iterator, typename Attr = std::vector<char> > // or std::string for that matter
    struct EscapedUnicode : qi::grammar<Iterator, Attr()>
    {
        EscapedUnicode() : EscapedUnicode::base_type(escaped_unicode)
        {
            using namespace qi;
    
            escaped_unicode = '\\' > ( ("u" >> uint_parser<uint16_t, 16, 4, 4>() [ encode(_val, _1) ])
                                     | ("U" >> uint_parser<uint32_t, 16, 8, 8>() [ encode(_val, _1) ]) );
    
            BOOST_SPIRIT_DEBUG_NODES((escaped_unicode))
        }
    
        struct encode_f {
            template <typename...> struct result { using type = void; };
    
            template <typename V, typename CP> void operator()(V& a, CP codepoint) const {
                // TODO implement desired encoding (e.g. UTF8)
                bio::stream<bio::back_insert_device<V> > os(a);
                os << "[0x" << std::hex << std::setw(std::numeric_limits<CP>::digits/4) << std::setfill('0') << codepoint << "]";
            }
        };
        boost::phoenix::function<encode_f> encode;
    
        qi::rule<Iterator, Attr()> escaped_unicode;
    };
    
    template <typename Iterator>
    struct QuotedString : qi::grammar<Iterator, std::string()>
    {
        QuotedString() : QuotedString::base_type(start)
        {
            start = quoted_string;
            quoted_string = '"' >> *(escaped_unicode | +(qi::char_ - ('"' | qi::eol | "\\u" | "\\U"))) >> '"';
            BOOST_SPIRIT_DEBUG_NODES((start)(quoted_string))
        }
    
        EscapedUnicode<Iterator> escaped_unicode;
        qi::rule<Iterator, std::string()> start;
        qi::rule<Iterator, std::vector<char>()> quoted_string;
    };
    
    int main() {
        std::string input = "\"foo\\u0041\\U00000041\"";
    
        typedef std::string::const_iterator iterator_type;
        QuotedString<iterator_type> qs;
        std::string result;
        bool r = parse( input.cbegin(), input.cend(), qs, result );
        std::cout << std::boolalpha << r << ": '" << result << "'\n";
    }
    

    Prints:

    true: 'foo[0x0041][0x00000041]'