Search code examples
c++urlboost-spiriturl-parsing

URL parsing using boost::spirit


I’m experimenting with boost::spirit to write a URL parser. My objective is to parse the input URL (valid or invalid) and break it down into prefix, host and suffix as below:

Input ipv6 URL: https://[::ffff:192.168.1.1]:8080/path/to/resource
Break this into below parts:
Prefix: https://
Host: ::ffff:192.168.1.1
Suffix: :8080/path/to/resource

Input ipv6 URL: https://::ffff:192.168.1.1/path/to/resource
Break this into below parts:
Prefix: https://
Host: ::ffff:192.168.1.1
Suffix: /path/to/resource

Input ipv4 URL: https://192.168.1.1:8080/path/to/resource
Break this into below parts:
Prefix: https://
Host: 192.168.1.1
Suffix: :8080/path/to/resource

The colon character ‘:’ is used as delimiter in ipv6 address and also as delimiter for port in ipv4 address. Due to this ambiguity, I’m having hard time defining the boost::spirit grammar that works both for ipv4 and ipv6 URLs. Please refer the code below:

struct UrlParts
{
    std::string scheme;
    std::string host;
    std::string port;
    std::string path;
};

BOOST_FUSION_ADAPT_STRUCT(
    UrlParts,
    (std::string, scheme)
    (std::string, host)
    (std::string, port)
    (std::string, path)
)

void parseUrl_BoostSpirit(const std::string &input, std::string &prefix, std::string &suffix, std::string &host)
{
    namespace qi = boost::spirit::qi;

    // Define the grammar
    qi::rule<std::string::const_iterator, UrlParts()> url = -(+qi::char_("a-zA-Z0-9+-.") >> "://") >> -qi::lit('[') >> +qi::char_("a-fA-F0-9:.") >> -qi::lit(']') >> -(qi::lit(':') >> +qi::digit) >> *qi::char_;


    // Parse the input
    UrlParts parts;
    auto iter = input.begin();
    if (qi::parse(iter, input.end(), url, parts))
    {
        prefix = parts.scheme.empty() ? "" : parts.scheme + "://";
        host = parts.host;
        suffix = (parts.port.empty() ? "" : ":" + parts.port) + parts.path;
    }
    else
    {
        host = input;
    }
}

above code produces incorrect output for ipv4 URL as below:

Input URL ipv4: https://192.168.1.1:8080/path/to/resource
Broken parts:
Prefix: https://
Host: 192.168.1.1:8080
Suffix: /path/to/resource
i.e. Host is having :8080 instead of having it in Suffix.

If I change the URL grammar, I can fix the ipv4 but then ipv6 breaks.

Of-course this can be done using trivial if-else parsing logic, but I'm trying to do it more elegantly using boost::spirit. Any suggestions on how to update the grammar to support both ipv4 and ipv6 URLs ?

PS: I'm aware that URLs with ipv6 address w/o [ ] are invalid as per RFC, but the application I'm working on requires processing these invalid URLs as well.

Thanks in advance!


Solution

  • First off your expression char_("+-.") accidentally allows for ',' inside the scheme: https://coliru.stacked-crooked.com/a/14c00775d9f3d99e

    To innoculate against that always put - first or last in character sets so it can't be misinterpreted as a range: char_("+.-"). Yeah, that's subtle.

    -'[' >> p >> -']' allows for unmatched brackets. Instead say ('[' >> p >> ']' | p).

    With those applied, let's rewrite the parser expression so we see what's happening:

    // Define the grammar
    auto scheme_ = qi::copy(+qi::char_("a-zA-Z0-9+.-") >> "://");
    auto host_   = qi::copy(+qi::char_("a-fA-F0-9:."));
    auto port_   = qi::copy(':' >> +qi::digit);
    
    qi::rule<std::string::const_iterator, UrlParts()> const url =
        -scheme_ >> ('[' >> host_ >> ']' | host_) >> -port_ >> *qi::char_;
    

    So I went on to create a test-bed to demonstrate your question examples:

    Note I simplified the handling by adding raw[] to include :// and just returning and printing UrlParts because it is more insightful to see what the parser does

    Live On Coliru

    // #define BOOST_SPIRIT_DEBUG
    #include <boost/spirit/include/qi.hpp>
    #include <boost/pfr/io.hpp>
    
    struct UrlParts { std::string scheme, host, port, path; };
    BOOST_FUSION_ADAPT_STRUCT(UrlParts, scheme, host, port, path)
    
    UrlParts parseUrl_BoostSpirit(std::string_view input) {
        namespace qi = boost::spirit::qi;
    
        using It = std::string_view::const_iterator;
        qi::rule<It, UrlParts()> url;
        //using R = qi::rule<It, std::string()>;
        //R scheme_, host_, port_;
        auto scheme_ = qi::copy(qi::raw[+qi::char_("a-zA-Z0-9+.-") >> "://"]);
        auto host_   = qi::copy(+qi::char_("a-fA-F0-9:."));
        auto port_   = qi::copy(':' >> +qi::digit);
        url          = -scheme_ >> ('[' >> host_ >> ']' | host_) >> -port_ >> *qi::char_;
    
        // BOOST_SPIRIT_DEBUG_NODES((scheme_)(host_)(port_)(url));
        BOOST_SPIRIT_DEBUG_NODES((url));
    
        // Parse the input
        UrlParts parts;
        parse(input.begin(), input.end(), qi::eps > url > qi::eoi, parts);
        return parts;
    }
    
    int main() {
        using It        = std::string_view::const_iterator;
        using Exception = boost::spirit::qi::expectation_failure<It>;
    
        for (std::string_view input : {
                 "https://[::ffff:192.168.1.1]:8080/path/to/resource",
                 "https://::ffff:192.168.1.1/path/to/resource",
                 "https://192.168.1.1:8080/path/to/resource",
             }) {
            try {
                auto parsed = parseUrl_BoostSpirit(input);
                // using boost::fusion::operator<<; // less clear output, without PFR
                // std::cout << std::quoted(input) << " -> " << parsed << std::endl;
                std::cout << std::quoted(input) << " -> " << boost::pfr::io(parsed) << std::endl;
            } catch (Exception const& e) {
                std::cout << std::quoted(input) << " EXPECTED " << e.what_ << " at "
                          << std::quoted(std::string_view(e.first, e.last)) << std::endl;
            }
        }
    }
    

    Prints:

    "https://[::ffff:192.168.1.1]:8080/path/to/resource" -> {"https://", "::ffff:192.168.1.1", "8080", "/path/to/resource"}
    "https://::ffff:192.168.1.1/path/to/resource" -> {"https://", "::ffff:192.168.1.1", "", "/path/to/resource"}
    "https://192.168.1.1:8080/path/to/resource" -> {"https://", "192.168.1.1:8080", "", "/path/to/resource"}
    

    The Problem

    You already assessed the problem: :8080 matches the production for host_. I'd reason that the port specification is the odd one out because it must be the last before '/' or the end of input. In other words:

    auto port_   = qi::copy(':' >> +qi::digit >> &('/' || qi::eoi));
    

    Now you can do a negative look-ahead assertion in your host_ production to avoid eating port specifications:

    auto host_   = qi::copy(+(qi::char_("a-fA-F0-9:.") - port_));
    

    Now the output becomes

    Live On Coliru

    "https://[::ffff:192.168.1.1]:8080/path/to/resource" -> {"https://", "::ffff:192.168.1.1", "8080", "/path/to/resource"}
    "https://::ffff:192.168.1.1/path/to/resource" -> {"https://", "::ffff:192.168.1.1", "", "/path/to/resource"}
    "https://192.168.1.1:8080/path/to/resource" -> {"https://", "192.168.1.1", "8080", "/path/to/resource"}
    

    Note that there are some inefficiencies and probably RFC violations in this implementation. Consider a static instance of the grammar. Also consider using X3.

    Using X3 and Asio

    I have a related answer here: What is the nicest way to parse this in C++?. It shows an X3 approach with validation using Asio's networking primitives.

    Boost URL

    Why roll your own?

    UrlParts parseUrl(std::string_view input) {
        auto parsed = boost::urls::parse_uri(input).value();
        return {parsed.scheme(), parsed.host(), parsed.port(), std::string(parsed.encoded_resource())};
    }
    

    To be really pedantic and get the :// as well:

    UrlParts parseUrl(std::string_view input) {
        auto parsed = boost::urls::parse_uri(input).value();
        assert(parsed.has_authority());
        return {
            parsed.buffer().substr(0, parsed.authority().data() - input.data()),
            parsed.host(),
            parsed.port(),
            std::string(parsed.encoded_resource()),
        };
    }
    

    This parses what you have and much more (fragment from the Reference Help Card):

    enter image description here

    The notable value is

    • conformance (yes this means that IPV6 requires [])
    • proper encoding and decoding
    • low allocation (many operations work exclusively on the source stringview)
    • maintenance (you don't need to debug/audit it yourself)

    Live On Coliru

    ==== "https://[::ffff:192.168.1.1]:8080/path/to/resource" ====
     Spirit -> {"https://", "::ffff:192.168.1.1", "8080", "/path/to/resource"}
     URL    -> {"https://", "[::ffff:192.168.1.1]", "8080", "/path/to/resource"}
    ==== "https://::ffff:192.168.1.1/path/to/resource" ====
     Spirit -> {"https://", "::ffff:192.168.1.1", "", "/path/to/resource"}
     URL    -> leftover [boost.url.grammar:4]
    ==== "https://192.168.1.1:8080/path/to/resource" ====
     Spirit -> {"https://", "192.168.1.1", "8080", "/path/to/resource"}
     URL    -> {"https://", "192.168.1.1", "8080", "/path/to/resource"}
    ==== "https://192.168.1.1:8080/s?quey=param&other=more%3Dcomplicated#bookmark" ====
     Spirit -> {"https://", "192.168.1.1", "8080", "/s?quey=param&other=more%3Dcomplicated#bookmark"}
     URL    -> {"https://", "192.168.1.1", "8080", "/s?quey=param&other=more%3Dcomplicated#bookmark"}