Search code examples
c++boost-spiritboost-spirit-qi

Boost.Spirit, how to extend xml parsing?


I would like to extend xml parsing using Boost.Spirit, and would like to add parsing of xml attributes.

Here example from library and some modifications from me:

template <typename Iterator>
struct mini_xml_grammar
: qi::grammar<Iterator, mini_xml(), qi::locals<std::string>, ascii::space_type>
{
    mini_xml_grammar()
    : mini_xml_grammar::base_type(xml, "xml")
    {
        using qi::lit;
        using qi::lexeme;
        using qi::attr;
        using qi::on_error;
        using qi::fail;
        using ascii::char_;
        using ascii::string;
        using ascii::alnum;
        using ascii::space;

        using namespace qi::labels;

        using phoenix::construct;
        using phoenix::val;


        text %= lexeme[+(char_ - '<')];
        node %= xml | text;


        start_tag %=
        '<'
        >>  !lit('/')
        >   lexeme[+(char_ - '>')]
        >   '>'
        ;

        end_tag =
        "</"
        >   string(_r1)
        >   '>'
        ;

        xml %=
        start_tag[_a = _1]
        >   *node
        >   end_tag(_a)
        ;

        xml.name("xml");
        node.name("node");
        text.name("text");
        start_tag.name("start_tag");
        end_tag.name("end_tag");

        on_error<fail>
        (
         xml
         , std::cout
         << val("Error! Expecting ")
         << _4                               // what failed?
         << val(" here: \"")
         << construct<std::string>(_3, _2)   // iterators to error-pos, end
         << val("\"")
         << std::endl
         );
    }

    qi::rule<Iterator, mini_xml(), qi::locals<std::string>, ascii::space_type> xml;
    qi::rule<Iterator, mini_xml_node(), ascii::space_type> node;
    qi::rule<Iterator, std::string(), ascii::space_type> text;
    qi::rule<Iterator, std::string(), ascii::space_type> attribute;
    qi::rule<Iterator, std::string(), ascii::space_type> start_tag;
    qi::rule<Iterator, void(std::string), ascii::space_type> end_tag;
};

I've tried this, but it does not compile with error "use of undeclared identifier 'eps'":

        xml %= 
        start_tag[_a = _1] 
        > attribute 
        > (  "/>" > eps
            |  ">" > *node > end_tag(_a) 
            )
        ;

Does anyone know how to do it? How to add ability to parse xml attributes?


Solution

  • The eps identifier, like many of the other identifiers you use, are defined in the qi namespace. The others are brought into the global namespace with the using statements at the top of your constructor. Do the same for eps:

    using qi::eps;
    

    Once you resolve that, you have the larger issue of whether you're correctly representing the syntax and grammar of XML. It doesn't look like you have it right. You have this:

    xml %= 
          start_tag[_a = _1]
        > attribute
        > (   "/>" > eps
            | ">" > *node > end_tag(_a)
          )
        ;
    

    That can't be right, though. Attributes are part of a tag, not things that follow a tag. It looks like you wanted to break start_tag appart so you could handle empty tags. If I were doing this, I'd probably create an empty_tag rule instead, and then change xml to be empty_tag | (start_tag > *node > end_tag). That's how the W3C language recommendation does it:

    [39]  element   ::= EmptyElemTag
                        | STag content ETag
    

    But don't worry about that for now. Remember that your stated task is to add attributes to the parser. Don't get distracted by other missing features. There are plenty of those to work on later.

    I mentioned the W3C document. You should refer to that often; it defines the language, and it even shows the grammar. One of the design goals of Spirit was that it should look like a grammar definition. Use that to your advantage by trying to mimic the W3C grammar in your own code. The W3C defines the start tag like this:

    [40]  STag      ::= '<' Name (S Attribute)* S? '>'
    [41]  Attribute ::= Name Eq AttValue    
    

    So write your code like this:

    start_tag %=
        // Can't use operator> for "expect" because empty_tag
        // will be the same up to the final line.
           '<'
        >> !lit('/')
        >> name
        >> *attribute
        >> '>'
        ;
    
    name %= ...; // see below
    
    attribute %=
          name
        > '='
        > attribute_value
        ;
    

    The spec defines attributes-value syntax:

    [10]  AttValue  ::= '"' ([^<&"] | Reference)* '"'
                        |  "'" ([^<&'] | Reference)* "'"
    

    I wouldn't worry about entity references yet. Like empty tags, your current code already doesn't support them, so it's not important to add them now as part of attributes. That makes attribute_value easy to define:

    attribute_value %=
          '"' > *(char_ - char_("<&\"")) > '"'
        | '\'' > *(char_ - char_("<&'")) > '\''
        ;
    

    The name definition doesn't have to be anything fancy yet. It's complicated in the specification because it handles the full Unicode range of characters, but you can start with something simpler and come back to it later, when you figure out how to handle Unicode characters throughout your parser.

    name %=
        lexeme[char_("a-zA-Z:_") >> *char_("-a-zA-Z0-9:_")]
        ;
    

    These changes should allow you to parse XML attributes. However, it's another matter to extract the results as Spirit attributes (so you can know the names and values of attributes for a given tag in the rest of your program), and I'm not prepared to discuss that right now.