Search code examples
boostboost-spirithostnameboost-spirit-qidomain-name

Why is this Boost Spirit grammar double-outputting part of the match


I'm debugging an existing Boost QI grammar in our software for parsing "endpoints" (host:port, where host can be a hostname, IPv4 address, or IPv6 address). I'm specifically having a problem with the hostname part (the port, IPv4, and IPv6 parts all work perfectly). The full code is extensive, so I've simplified it down to the following example, which can be easily run on Wandbox (C++17, Boost 1.79.0)

#include <iostream>
#include <string>
#include <boost/fusion/include/std_pair.hpp>
#include <boost/fusion/include/boost_array.hpp>
#include <boost/phoenix.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/algorithm/string.hpp>

class Endpoint
{
public:
   Endpoint(): _host(""), _port(0) {};
   std::string _host;
   unsigned short _port;
};

template < typename ITERATOR >
struct HostnameGrammar :
 boost::spirit::qi::grammar< ITERATOR, std::string() >
{
 HostnameGrammar()
 :
   HostnameGrammar::base_type( start )
 {
   using boost::spirit::qi::alnum;
   using boost::spirit::qi::alpha;
   using boost::spirit::qi::char_;
   using boost::spirit::_1;

   start %=
     dottedName | singleName
     ;

   dottedName %=
     +( singleName >> char_( '.' ) ) >>
     tld
     ;

   singleName %=
     +alnum >> *( char_('-') >> +alnum )
     ;

   tld %=
     +alpha >> *( -char_('-') >> +alnum )
     ;
 }

 boost::spirit::qi::rule< ITERATOR, std::string() > start;
 boost::spirit::qi::rule< ITERATOR, std::string() > dottedName;
 boost::spirit::qi::rule< ITERATOR, std::string() > singleName;
 boost::spirit::qi::rule< ITERATOR, std::string() > tld;
};

template < typename ITERATOR >
struct EndpointGrammar :
 boost::spirit::qi::grammar< ITERATOR, std::string() >
{
 EndpointGrammar(
   Endpoint & endpoint )
 :
   EndpointGrammar::base_type( start ),
   endpoint_( endpoint )
 {
   using boost::spirit::qi::ushort_;
   using boost::spirit::_1;

   start =
     -address[ boost::phoenix::ref( endpoint._host ) = _1 ] >>
     -( ':' >> ushort_[ boost::phoenix::ref( endpoint._port ) = _1 ] )
   ;

   address %=
     hostname;
 }

 Endpoint & endpoint_;

 boost::spirit::qi::rule< ITERATOR, std::string() > start;
 boost::spirit::qi::rule< ITERATOR, std::string() > address;

 HostnameGrammar< ITERATOR > hostname;
};

int main()
{
 std::vector< std::string > endpointStrings {
   // these should parse successfully (and do)
   "0foo", "foo", "foo.net", "foo:1234", "foo.net:5678", "foo.example.net", "foo.example.net:9012",
   "foo-bar", "foo-bar.com", "foo-bar:1234", "foo-bar.net-0:5678", "foo-bar.example.com-1:9012",

   // these should fail to parse (and do)
   "foo.0bar", "foo.0bar:1234", "foo.bar-", "foo.bar-:1234", "foo-", "-foo"
 };

 for ( auto const & endpointString : endpointStrings )
 {
   Endpoint tempEndpoint;
   std::string::const_iterator beginIt( endpointString.begin() );
   std::string::const_iterator endIt( endpointString.end() );
   EndpointGrammar< std::string::const_iterator > grammar( tempEndpoint );

   if ( !boost::spirit::qi::parse( beginIt, endIt, grammar ) || beginIt != endIt )
   {
     std::cout << "Failed: " << endpointString << std::endl;
   }
   else
   {
     std::cout << "Succeeded: " << endpointString << " = " << tempEndpoint._host << " / " << tempEndpoint._port << std::endl;
   }
 }

 return 0;
}

The grammar successfully parses or fails to parse all the examples that it should parse or fail to parse per RFC rules, so that great. The problem is that whatever part is the "last" part of the hostname gets "doubled." Here's the output of running that:

Succeeded: 0foo = 0foo0foo / 0
Succeeded: foo = foofoo / 0
Succeeded: foo.net = foo.netnet / 0
Succeeded: foo:1234 = foofoo / 1234
Succeeded: foo.net:5678 = foo.netnet / 5678
Succeeded: foo.example.net = foo.example.netnet / 0
Succeeded: foo.example.net:9012 = foo.example.netnet / 9012
Succeeded: foo-bar = foo-barfoo-bar / 0
Succeeded: foo-bar.com = foo-bar.comcom / 0
Succeeded: foo-bar:1234 = foo-barfoo-bar / 1234
Succeeded: foo-bar.net-0:5678 = foo-bar.net-0net-0 / 5678
Succeeded: foo-bar.example.com-1:9012 = foo-bar.example.com-1com-1 / 9012
Failed: foo.0bar
Failed: foo.0bar:1234
Failed: foo.bar-
Failed: foo.bar-:1234
Failed: foo-
Failed: -foo

Note "foofoo", "foo.netnet", "foo.example.netnet", etc.

I've tried about 300 different variations of these rules—far too many to include them all here. Suffice it to say that many variations successfully identify valid vs invalid hostnames, but all those that do so correctly also suffer from the same duplicated last-part problem. I'm all out of ideas over here. Anyone know why this doesn't work and how to fix it?

Note that I've even tried the simplest possible alphabetic-only rule to achieve what I'm trying to achieve in order to expand on that rule with my more-complicated character sets, but even this duplicates "foo.net" into "foo.netnet":

start %=
  +( +alpha >> char_( '.' ) ) >> +alpha
  ;

Solution

  • Without even looking, I know you've run into non-atomic attribute propagation into container attributes, see e.g.

    The classical fix would be to apply qi::hold[] - keeping in mind the performance cost under backtracking.

    Proof & "How To Debug"?

    Slapping on some BOOST_SPIRIT_DEBUG* macros, let's zoom in on the foo.netnet case:

    tld        = +qi::alpha >> *(-qi::char_('-') >> +qi::alnum);
    singleName = +qi::alnum >> *(qi::char_('-') >> +qi::alnum);
    dottedName = +(singleName >> qi::char_('.')) >> tld;
    
    hostName   = dottedName | singleName;
    
    BOOST_SPIRIT_DEBUG_NODES((hostName)(dottedName)(singleName)(tld))
    

    Prints (excerpt, Live On Coliru)

      <hostName>
        <try>foo.net</try>
        <dottedName>
          <try>foo.net</try>
          <singleName>
            <try>foo.net</try>
            <success>.net</success>
            <attributes>[[f, o, o]]</attributes>
          </singleName>
          <singleName>
            <try>net</try>
            <success></success>
            <attributes>[[f, o, o, ., n, e, t]]</attributes>
          </singleName>
          <tld>
            <try>net</try>
            <success></success>
            <attributes>[[f, o, o, ., n, e, t, n, e, t]]</attributes>
          </tld>
          <success></success>
          <attributes>[[f, o, o, ., n, e, t, n, e, t]]</attributes>
        </dottedName>
        <success></success>
        <attributes>[[f, o, o, ., n, e, t, n, e, t]]</attributes>
      </hostName>
    

    As you can see, the first singleName includes .net, only to backtrack because no more '.' follows. Then tld matches.

    The "tedious" ways to fix it would be to use hold[] (Live):

    dottedName = +qi::hold[singleName >> qi::char_('.')] >> tld;
    

    Or using a lookahead assertion (less efficient!, Live)

    dottedName = +(&(singleName >> '.') >> singleName >> qi::char_('.')) >> tld;
    

    Simplify!

    Both work. But since all you do at current is propagate the input 1:1 to the resulting string, you can do everything a lot simpler, using raw[], operator% and literals instead of char_:

    tld        = alpha >> *('-' >> alnum | alnum);
    singleName = +alnum % '-';
    dottedName = +(singleName >> '.') >> tld;
    
    hostName = raw[dottedName | singleName];
    

    In this setup, tld, singleName and dottedName need not build an attribute at all, so backtracking is irrelevant. See it Live With Full Test Cases

    #include <boost/phoenix.hpp>
    #include <boost/spirit/include/qi.hpp>
    #include <iomanip>
    #include <iostream>
    namespace qi = boost::spirit::qi;
    
    namespace Ast {
        struct Endpoint {
            std::string    host = "";
            unsigned short port = 0;
    
            auto operator<=>(Endpoint const&) const = default;
        };
    }
    
    namespace Grammar {
    
        template <typename It> struct Hostname : qi::grammar<It, std::string()> {
            Hostname() : Hostname::base_type(hostName) {
                tld        = qi::alpha >> *('-' >> qi::alnum | qi::alnum);
                singleName = +qi::alnum % '-';
                dottedName = +(singleName >> '.') >> tld;
    
                hostName   = qi::raw[dottedName | singleName];
            }
    
          private:
            qi::rule<It, std::string()> hostName;
            qi::rule<It>                dottedName, singleName, tld;
        };
    
        template <typename It> struct Endpoint : qi::grammar<It, std::string()> {
            Endpoint(Ast::Endpoint& out) : Endpoint::base_type(endpoint) {
                using namespace qi::labels;
                namespace px = boost::phoenix;
    
                address = hostname;
    
                endpoint =                              //
                    -address[px::ref(out.host) = _1] >> //
                    -(':' >> qi::ushort_[px::ref(out.port) = _1]);
            }
    
          private:
            Hostname<It>                hostname;
            qi::rule<It, std::string()> endpoint, address;
        };
    }
    
    int main() {
        using It = std::string_view::const_iterator;
    
        struct {
            std::string_view input;
            Ast::Endpoint    expected;
        } cases[] = {
            //// these should parse successfully (and do)
            {"0foo", {"0foo", 0}},
            {"foo", {"foo", 0}},
            {"foo.net", {"foo.net", 0}},
            {"foo:1234", {"foo", 1234}},
            {"foo.net:5678", {"foo.net", 5678}},
            {"foo.example.net", {"foo.example.net", 0}},
            {"foo.example.net:9012", {"foo.example.net", 9012}},
            {"foo-bar", {"foo-bar", 0}},
            {"foo-bar.com", {"foo-bar.com", 0}},
            {"foo-bar:1234", {"foo-bar", 1234}},
            {"foo-bar.net-0:5678", {"foo-bar.net-0", 5678}},
            {"foo-bar.example.com-1:9012", {"foo-bar.example.com-1", 9012}},
    
            // these should fail to parse (and do)
            {"foo.0bar", {}},
            {"foo.0bar:1234", {}},
            {"foo.bar-", {}},
            {"foo.bar-:1234", {}},
            {"foo-", {}},
            {"-foo", {}},
        };
    
        for (auto [input, expected] : cases) {
            Ast::Endpoint         actual;
            auto                  f(begin(input)), l(end(input));
            Grammar::Endpoint<It> grammar(actual);
    
            if (parse(f, l, grammar >> qi::eoi)) {
                std::cout << (actual == expected ? "PASS\t" : "FAIL\t") << quoted(input) //
                          << " -> " << actual.host << " / " << actual.port << "\n";
            } else {
                std::cout << (expected == Ast::Endpoint{} ? "PASS\t" : "FAIL\t") << quoted(input)
                          << " (not a valid endpoint)\n";
            }
        }
    }
    

    Printing

    PASS    "0foo" -> 0foo / 0
    PASS    "foo" -> foo / 0
    PASS    "foo.net" -> foo.net / 0
    PASS    "foo:1234" -> foo / 1234
    PASS    "foo.net:5678" -> foo.net / 5678
    PASS    "foo.example.net" -> foo.example.net / 0
    PASS    "foo.example.net:9012" -> foo.example.net / 9012
    PASS    "foo-bar" -> foo-bar / 0
    PASS    "foo-bar.com" -> foo-bar.com / 0
    PASS    "foo-bar:1234" -> foo-bar / 1234
    PASS    "foo-bar.net-0:5678" -> foo-bar.net-0 / 5678
    PASS    "foo-bar.example.com-1:9012" -> foo-bar.example.com-1 / 9012
    PASS    "foo.0bar" (not a valid endpoint)
    PASS    "foo.0bar:1234" (not a valid endpoint)
    PASS    "foo.bar-" (not a valid endpoint)
    PASS    "foo.bar-:1234" (not a valid endpoint)
    PASS    "foo-" (not a valid endpoint)
    PASS    "-foo" (not a valid endpoint)
    

    Code Review: The Problem Of Semantic Actions

    However looking at the code I notice "gratuitous" semantic actions. They, too, are not atomic. See this staple on SA: Boost Spirit: "Semantic actions are evil"?

    I'd strongly suggest avoiding them unless absolutely necessary. This also makes the parser stateless, meaning you don't need to construct and compile the rules/expressions for every parse.

    Live On Coliru

    Grammar::Hostname<It> hostname_;
    
    for (auto [input, expected] : cases) {
        Ast::Endpoint ep;
    
        if (parse(begin(input), end(input),                       //
                  -hostname_ >> -(':' >> qi::ushort_) >> qi::eoi, //
                  ep.host, ep.port))
        {
            std::cout << (ep == expected ? "PASS\t" : "FAIL\t") << quoted(input) //
                      << " -> " << ep.host << " / " << ep.port << "\n";
        } else {
            std::cout << (expected == Ast::Endpoint{} ? "PASS\t" : "FAIL\t") << quoted(input)
                      << " (not a valid endpoint)\n";
        }
    }
    

    With the same output as before, but much smaller code, much shorter compile times.

    Grammar/AST Adaptation

    To make it easier to pass your application AST type instead of binding separate host/port string attributes, you might want to use std::pair or adapt your custom type:

    Live On Coliru

    #include <boost/spirit/include/qi.hpp>
    #include <iomanip>
    #include <iostream>
    namespace qi = boost::spirit::qi;
    
    namespace Ast {
        struct Endpoint {
            std::string    host = "";
            unsigned short port = 0;
    
            auto operator<=>(Endpoint const&) const = default;
        };
    }
    
    BOOST_FUSION_ADAPT_STRUCT(Ast::Endpoint, host, port)
    
    namespace Grammar {
        template <typename It> struct Hostname : qi::grammar<It, std::string()> {
            Hostname() : Hostname::base_type(hostName) {
                tld        = qi::alpha >> *('-' >> qi::alnum | qi::alnum);
                singleName = +qi::alnum % '-';
                dottedName = +(singleName >> '.') >> tld;
    
                hostName   = qi::raw[dottedName | singleName];
            }
    
          private:
            qi::rule<It, std::string()> hostName;
            qi::rule<It>                dottedName, singleName, tld;
        };
    
        template <typename It> struct Endpoint : qi::grammar<It, Ast::Endpoint()> {
            Endpoint() : Endpoint::base_type(endpoint) {
                endpoint = -hostname >> -(':' >> qi::ushort_);
            }
    
          private:
            Hostname<It>                  hostname;
            qi::rule<It, Ast::Endpoint()> endpoint, address;
        };
    }
    
    int main() {
        using It = std::string_view::const_iterator;
    
        struct {
            std::string_view input;
            Ast::Endpoint    expected;
        } cases[] = {
            //// these should parse successfully (and do)
            {"0foo", {"0foo", 0}},
            {"foo", {"foo", 0}},
            {"foo.net", {"foo.net", 0}},
            {"foo:1234", {"foo", 1234}},
            {"foo.net:5678", {"foo.net", 5678}},
            {"foo.example.net", {"foo.example.net", 0}},
            {"foo.example.net:9012", {"foo.example.net", 9012}},
            {"foo-bar", {"foo-bar", 0}},
            {"foo-bar.com", {"foo-bar.com", 0}},
            {"foo-bar:1234", {"foo-bar", 1234}},
            {"foo-bar.net-0:5678", {"foo-bar.net-0", 5678}},
            {"foo-bar.example.com-1:9012", {"foo-bar.example.com-1", 9012}},
    
            // these should fail to parse (and do)
            {"foo.0bar", {}},
            {"foo.0bar:1234", {}},
            {"foo.bar-", {}},
            {"foo.bar-:1234", {}},
            {"foo-", {}},
            {"-foo", {}},
        };
    
        static Grammar::Endpoint<It> const grammar;
    
        for (auto [input, expected] : cases) {
            Ast::Endpoint ep;
    
            if (parse(begin(input), end(input), grammar >> qi::eoi, ep)) {
                std::cout << (ep == expected ? "PASS\t" : "FAIL\t") << quoted(input) //
                          << " -> " << ep.host << " / " << ep.port << "\n";
            } else {
                std::cout << (expected == Ast::Endpoint{} ? "PASS\t" : "FAIL\t") << quoted(input)
                          << " (not a valid endpoint)\n";
            }
        }
    }
    

    Still with the same output as ever. Note how the grammar is stateless.