c++11 boost boost-spirit boost-spirit-qi

Boost Spirit: Sub-grammar appending to string?

I am toying with Boost.Spirit. As part of a larger work I am trying to construct a grammar for parsing C/C++ style string literals. I encountered a problem:

How do I create a sub-grammar that appends a std::string() result to the calling grammar's std::string() attribute (instead of just a char?

Here is my code, which is working so far. (Actually I already got much more than that, including stuff like '\n' etc., but I cut it down to the essentials.)

#define BOOST_SPIRIT_UNICODE

#include <string>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix_operator.hpp>

using namespace boost;
using namespace boost::spirit;
using namespace boost::spirit::qi;

template < typename Iterator >
struct EscapedUnicode : grammar< Iterator, char() > // <-- should be std::string
{
    EscapedUnicode() : EscapedUnicode::base_type( escaped_unicode )
    {
        escaped_unicode %= "\\" > ( ( "u" >> uint_parser< char, 16, 4, 4 >() )
                                  | ( "U" >> uint_parser< char, 16, 8, 8 >() ) );
    }

    rule< Iterator, char() > escaped_unicode;  // <-- should be std::string
};

template < typename Iterator >
struct QuotedString : grammar< Iterator, std::string() >
{
    QuotedString() : QuotedString::base_type( quoted_string )
    {
        quoted_string %= '"' >> *( escaped_unicode | ( char_ - ( '"' | eol ) ) ) >> '"';
    }

    EscapedUnicode< Iterator > escaped_unicode;
    rule< Iterator, std::string() > quoted_string;
};

int main()
{
    std::string input = "\"foo\u0041\"";
    typedef std::string::const_iterator iterator_type;
    QuotedString< iterator_type > qs;
    std::string result;
    bool r = parse( input.cbegin(), input.cend(), qs, result );
    std::cout << result << std::endl;
}

This prints fooA -- the QuotedString grammar calls the EscapedUnicode grammar, which results in a char being added to the std::string attribute of QuotedString (the A, 0x41).

But of course I would need to generate a sequence of chars (bytes) for anything beyond 0x7f. EscapedUnicode would neet to produce a std::string, which would have to be appended to the string generated by QuotedString.

And that is where I've met a roadblock. I do not understand the things Boost.Spirit does in concert with Boost.Phoenix, and any attempts I have made resulted in lengthy and pretty much undecipherable template-related compiler errors.

So, how can I do this? The answer need not actually do the proper Unicode conversion; it's the std::string issue I need a solution for.

Solution

A few points applied:

please do not blanket using namespace in relation to highly generic code. ADL will ruin your day unless you control it
Operator %= is auto-rule assignment, meaning that automatic attribute propagation will be forced even in the presence of semantic actions. You don't want that because the attribute exposed by uint_parser will not be (correctly) automatically propagated if you want to encode into multi-byte string representation.
The input string
```
std::string input = "\"foo\u0041\"";
```
needed to be
```
std::string input = "\"foo\\u0041\"";
```
otherwise the compiler did the escape handling before the parser even runs :)

Here come the specific tricks for the meat of the task:

You will want to change the rule's declared attribute to something that Spirit will automatically "flatten" in simple sequences. E.g.
```
quoted_string = '"' >> *(escaped_unicode | (qi::char_ - ('"' | qi::eol))) >> '"';
```
Will not append because the first branch of the alternate results in a sequence of char, and the second in a single char. The following spelling of the equivalent:
```
quoted_string = '"' >> *(escaped_unicode | +(qi::char_ - ('"' | qi::eol | "\\u" | "\\U"))) >> '"';
```
subtly triggers the appending heuristic in Spirit, so we can achieve what we want without involving Semantic Actions.

The rest is straight-forward:

implement the actual encoding with a Phoenix function object:

struct encode_f {
    template <typename...> struct result { using type = void; };

    template <typename V, typename CP> void operator()(V& a, CP codepoint) const {
        // TODO implement desired encoding (e.g. UTF8)
        bio::stream<bio::back_insert_device<V> > os(a);
        os << "[" << std::hex << std::showbase << std::setw(std::numeric_limits<CP>::digits/4) << std::setfill('0') << codepoint << "]";
    }
};
boost::phoenix::function<encode_f> encode;

This you can then use like:

escaped_unicode = '\\' > ( ("u" >> uint_parser<uint16_t, 16, 4, 4>() [ encode(_val, _1) ])
                         | ("U" >> uint_parser<uint32_t, 16, 8, 8>() [ encode(_val, _1) ]) );

Because you mentioned you don't care about the specific encoding, I elected to encode the raw codepoint in 16bit or 32bit hex representation like [0x0041]. I pragmatically used Boost Iostreams which is capable of directly writing into the attribute's container type

Use BOOST_SPIRIT_DEBUG* macros

Live On Coliru

//#define BOOST_SPIRIT_UNICODE
//#define BOOST_SPIRIT_DEBUG

#include <string>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>

// for demo re-encoding
#include <boost/iostreams/device/back_inserter.hpp>
#include <boost/iostreams/stream.hpp>
#include <iomanip>

namespace qi  = boost::spirit::qi;
namespace bio = boost::iostreams;
namespace phx = boost::phoenix;

template <typename Iterator, typename Attr = std::vector<char> > // or std::string for that matter
struct EscapedUnicode : qi::grammar<Iterator, Attr()>
{
    EscapedUnicode() : EscapedUnicode::base_type(escaped_unicode)
    {
        using namespace qi;

        escaped_unicode = '\\' > ( ("u" >> uint_parser<uint16_t, 16, 4, 4>() [ encode(_val, _1) ])
                                 | ("U" >> uint_parser<uint32_t, 16, 8, 8>() [ encode(_val, _1) ]) );

        BOOST_SPIRIT_DEBUG_NODES((escaped_unicode))
    }

    struct encode_f {
        template <typename...> struct result { using type = void; };

        template <typename V, typename CP> void operator()(V& a, CP codepoint) const {
            // TODO implement desired encoding (e.g. UTF8)
            bio::stream<bio::back_insert_device<V> > os(a);
            os << "[0x" << std::hex << std::setw(std::numeric_limits<CP>::digits/4) << std::setfill('0') << codepoint << "]";
        }
    };
    boost::phoenix::function<encode_f> encode;

    qi::rule<Iterator, Attr()> escaped_unicode;
};

template <typename Iterator>
struct QuotedString : qi::grammar<Iterator, std::string()>
{
    QuotedString() : QuotedString::base_type(start)
    {
        start = quoted_string;
        quoted_string = '"' >> *(escaped_unicode | +(qi::char_ - ('"' | qi::eol | "\\u" | "\\U"))) >> '"';
        BOOST_SPIRIT_DEBUG_NODES((start)(quoted_string))
    }

    EscapedUnicode<Iterator> escaped_unicode;
    qi::rule<Iterator, std::string()> start;
    qi::rule<Iterator, std::vector<char>()> quoted_string;
};

int main() {
    std::string input = "\"foo\\u0041\\U00000041\"";

    typedef std::string::const_iterator iterator_type;
    QuotedString<iterator_type> qs;
    std::string result;
    bool r = parse( input.cbegin(), input.cend(), qs, result );
    std::cout << std::boolalpha << r << ": '" << result << "'\n";
}

Prints:

true: 'foo[0x0041][0x00000041]'