Search code examples
c++boostboost-spiritboost-spirit-qi

decode http header value fully with boost spirit


Once again, I find myself reaching for boost spirit. Once again I find myself defeated by it.

A HTTP header value takes the general form:

text/html; q=1.0, text/*; q=0.8, image/gif; q=0.6, image/jpeg; q=0.6, image/*; q=0.5, */*; q=0.1

i.e. value *OWS [; *OWS name *OWS [= *OWS possibly_quoted_value] *OWS [...]] *OWS [ , <another value> ...]

so in my mind, this header decodes to:

value[0]: 
  text/html
  params:
    name : q
    value : 1.0
value[1]:
  text/*
  params:
    name : q
    value : 0.8
...

and so on.

I am certain that to anyone who knows how, the boost::spirit::qi syntax for this is trivial.

I humbly ask for your assistance.

for example, here's the outline of the code which decodes the Content-Type header, which is limited to one value of the form type/subtype, with any number of parameters of the form <sp> ; <sp> token=token|quoted_string

template<class Iter>
void parse(ContentType& ct, Iter first, Iter last)
{
    ct.mutable_type()->append(to_lower(consume_token(first, last)));
    consume_lit(first, last, '/');
    ct.mutable_subtype()->append(to_lower(consume_token(first, last)));
    while (first != last) {
        skipwhite(first, last);
        if (consume_char_if(first, last, ';'))
        {
            auto p = ct.add_parameters();
            skipwhite(first, last);
            p->set_name(to_lower(consume_token(first, last)));
            skipwhite(first, last);
            if (consume_char_if(first, last, '='))
            {
                skipwhite(first, last);
                p->set_value(consume_token_or_quoted(first, last));
            }
            else {
                // no value on this parameter
            }
        }
        else if (consume_char_if(first, last, ','))
        {
            // normally we should get the next value-token here but in the case of Content-Type
            // we must barf
            throw std::runtime_error("invalid use of ; in Content-Type");
        }
    }
}

ContentType& populate(ContentType& ct, const std::string& header_value)
{
    parse(ct, header_value.begin(), header_value.end());
    return ct;
}

Solution

  • I've taken the code as posted by OP and given it a review.

    1. there's no need to specify void(). In fact it's preferable to use qi::unused_type in such cases, which is what rules will default to if no attribute type is declared.

    2. there no need for char_ if you don't wish to expose the attribute. Use lit instead.

    3. there is no need to wrap every char parser in a rule. That hurts performance. It's best to leave the proto expression tree un-evaluated as long so Qi can optimize parser expressions more, and the compiler can inline more.

      Also, Qi doesn't have move semantics on attributes, so avoiding redundant rules eliminates redundant copies of sub-attributes that get concatenated in the containing rules.

      Sample alternative spelling (caution, see Assigning parsers to auto variables)

      auto CR   = qi::lit('\r');
      auto LF   = qi::lit('\n');
      auto CRLF = qi::lit("\r\n");
      auto HT   = qi::lit('\t');
      auto SP   = qi::lit(' ');
      auto LWS  = qi::copy(-CRLF >> +(SP | HT)); // deepcopy
      
      UPALPHA = char_('A', 'Z');
      LOALPHA = char_('a', 'z');
      ALPHA   = UPALPHA | LOALPHA;
      DIGIT   = char_('0', '9');
      //CTL     = char_(0, 31) | char_(127);
      TEXT    = char_("\t\x20-\x7e\x80-\xff");
      
    4. Since you didn't have to use char_, you also don't have kill the attribute using qi::omit[].

    5. When you are in a Qi domain expression template, raw string/char literals are implicitly wrapped in a qi::lit so, you can simply things like

      quoted_pair   = omit[char_('\\')] >> char_;
      quoted_string = omit[char_('"')] >> *(qdtext | quoted_pair) >> omit[char_('"')];
      

      to just

      quoted_pair   = '\\' >> char_;
      quoted_string = '"' >> *(qdtext | quoted_pair) >> '"';
      
    6. instead of spelling out skipping spaces with omit[*SP] all the time, just declare the rule with a skipper. Now, you can simplify

      nvp               = token >> omit[*SP] >> omit['='] >> omit[*SP] >> value;
      any_parameter     = omit[*SP] >> omit[char_(';')] >> omit[*SP] >> (nvp | name_only);
      content_type_rule = type_subtype_rule >> *any_parameter;
      

      to just

      nvp               = token >> '=' >> value;
      any_parameter     = ';' >> (nvp | name_only);
      content_type_rule = type_subtype_rule >> qi::skip(spaces)[*any_parameter];
      

      Note that any subrule invocations of rules that are declared without a skipper are implicitly lexeme: Boost spirit skipper issues

    7. there were many redundant/unused headers

    8. recent compilers + boost versions make BOOST_FUSION_ADAPT_STRUCT much simpler by using decltype

    The results of simplifying are much less noisy:

    //#define BOOST_SPIRIT_DEBUG
    #include <boost/spirit/include/qi.hpp>
    #include <boost/fusion/include/adapted.hpp>
    
    struct parameter {
        boost::optional<std::string> name;
        std::string value;
    };
    
    struct type_subtype {
        std::string type;
        std::string subtype;
    };
    
    struct content_type {
        type_subtype type;
        std::vector<parameter> params;
    };
    
    BOOST_FUSION_ADAPT_STRUCT(type_subtype, type, subtype)
    BOOST_FUSION_ADAPT_STRUCT(content_type, type, params)
    
    template<class Iterator>
    struct token_grammar : qi::grammar<Iterator, content_type()>
    {
        token_grammar() : token_grammar::base_type(content_type_rule)
        {
            using qi::ascii::char_;
    
            spaces        = char_(' ');
            token         = +~char_( "()<>@,;:\\\"/[]?={} \t");
            quoted_string = '"' >> *('\\' >> char_ | ~char_('"')) >> '"';
            value         = quoted_string | token;
    
            type_subtype_rule = token >> '/' >> token;
            name_only         = token;
            nvp               = token >> '=' >> value;
            any_parameter     = ';' >> (nvp | name_only);
            content_type_rule = type_subtype_rule >> qi::skip(spaces) [*any_parameter];
    
            BOOST_SPIRIT_DEBUG_NODES((nvp)(any_parameter)(content_type_rule)(quoted_string)(token)(value)(type_subtype_rule))
        }
    
      private:
        using Skipper = qi::space_type;
        Skipper spaces;
    
        qi::rule<Iterator, binary_parameter(), Skipper> nvp;
        qi::rule<Iterator, parameter(), Skipper>        any_parameter;
        qi::rule<Iterator, content_type()>              content_type_rule;
    
        // lexemes
        qi::rule<Iterator, std::string()>               quoted_string, token, value;
        qi::rule<Iterator, type_subtype()>              type_subtype_rule;
        qi::rule<Iterator, unary_parameter()>           name_only;
    };
    

    See it Live On Coliru (with the same test cases)

    BONUS

    I'd prefer a simpler AST in a case like this. By injecting some attribute values using qi::attr you can avoid using boost::variant and/or even avoid boost::optional:

    struct parameter {
        bool have_name;
        std::string name;
        std::string value;
    };
    
    struct type_subtype {
        std::string type;
        std::string subtype;
    };
    
    struct content_type {
        type_subtype type;
        std::vector<parameter> params;
    };
    
    BOOST_FUSION_ADAPT_STRUCT(parameter, have_name, name, value)
    BOOST_FUSION_ADAPT_STRUCT(type_subtype, type, subtype)
    BOOST_FUSION_ADAPT_STRUCT(content_type, type, params)
    
    namespace qi = boost::spirit::qi;
    
    template<class Iterator>
    struct token_grammar : qi::grammar<Iterator, content_type()>
    {
        token_grammar() : token_grammar::base_type(content_type_rule)
        {
            using qi::ascii::char_;
    
            spaces        = char_(' ');
            token         = +~char_( "()<>@,;:\\\"/[]?={} \t");
            quoted_string = '"' >> *('\\' >> char_ | ~char_('"')) >> '"';
            value         = quoted_string | token;
    
            type_subtype_rule = token >> '/' >> token;
            name_only         = qi::attr(false) >> qi::attr("") >> token;
            nvp               = qi::attr(true)  >> token >> '=' >> value;
            any_parameter     = ';' >> (nvp | name_only);
            content_type_rule = type_subtype_rule >> qi::skip(spaces) [*any_parameter];
    
            BOOST_SPIRIT_DEBUG_NODES((nvp)(any_parameter)(content_type_rule)(quoted_string)(token)(value)(type_subtype_rule))
        }
    
      private:
        using Skipper = qi::space_type;
        Skipper spaces;
    
        qi::rule<Iterator, parameter(), Skipper> nvp, name_only, any_parameter;
        qi::rule<Iterator, content_type()>       content_type_rule;
    
        // lexemes
        qi::rule<Iterator, std::string()>        quoted_string, token, value;
        qi::rule<Iterator, type_subtype()>       type_subtype_rule;
    };