Search code examples
c++boostboost-url

Is there a way to get second level domain using boost-url?


I am trying to get the Second Level Domain of a URL using boost-url. For example if url is https://google.com, I want to store google in a std::string.

Here is a complete example:

#include <boost/url.hpp>
#include <iostream>

int main()
{
    std::string url_str = "https://google.com";

    result = boost::urls::parse_uri(url_str);
    boost::urls::url_view url = result.value();

    std::string protocol = url.scheme();
    std::string domain = std::string(url.host());

    std::cout << protocol << std::endl; // outputs `https`
    std::cout << domain << std::endl;   // outputs `google.com`
    // `google`, which is what I require here

    return 0;
}

I defined a method myself to extract SLD but I was wondering if boost-url also provides some function to do so:

std::string get_sec_level_domain(std::string domain)
{
    std::size_t pos = domain.find_last_of('.');
    if (pos != std::string::npos && pos > 0)
    {
        std::string sld = domain.substr(0, pos);
        return sld;
    }
    return "";
}

Solution

  • Boost URL implements general purpose URIs. Not all URIs contain fully-qualified domain names. As such, parsing the scheme-dependent parts of the authority are mostly out of scope for the library.

    However since internet addresses are ubiquitous in network URIs (ftp, ssh, sftp, http, etc) some support is there, and you might at least use that to your advantage to avoid misinterpreting information as if they were domain names:

    enter image description here

    As an example test bed:

    Live On Coliru1

    
    #include <boost/url.hpp>
    #include <iostream>
    
    int main () {
      for (auto txt : {
         // explicit port
         "https://my.pretty.sub.domain.com:8989/path/to/resource?stuff=more&stuff#end",
         "https://my.com:8989/path/to/resource?stuff=more&stuff#end",
         "https://localhost:8989/path/to/resource?stuff=more&stuff#end",
         "https://[::1]:8989/path/to/resource?stuff=more&stuff#end",
         "https://127.0.0.1:8989/path/to/resource?stuff=more&stuff#end",
         // without port
         "https://my.pretty.sub.domain.com/path/to/resource?stuff=more&stuff#end",
         "https://my.com/path/to/resource?stuff=more&stuff#end",
         "https://localhost/path/to/resource?stuff=more&stuff#end",
         "https://[::1]/path/to/resource?stuff=more&stuff#end",
         "https://127.0.0.1/path/to/resource?stuff=more&stuff#end",
      }) {
        if (auto parsed = boost::urls::parse_uri(txt); parsed && parsed->has_authority()) {
          auto url = parsed.value();
          switch (url.host_type ())
            {
            case boost::urls::host_type::ipv4:
            case boost::urls::host_type::ipv6:
            case boost::urls::host_type::ipvfuture:
            case boost::urls::host_type::none:
              std::cerr << "adress or none: '" << url.host () << "'\n";
              break;
            case boost::urls::host_type::name:
              std::cout << "maybe FQDN: '" << url.host_name () << "'\n";
              break;
            }
        }
      }
    }
    

    Printing

    maybe FQDN: 'my.pretty.sub.domain.com'
    maybe FQDN: 'my.com'
    maybe FQDN: 'localhost'
    adress or none: '[::1]'
    adress or none: '127.0.0.1'
    maybe FQDN: 'my.pretty.sub.domain.com'
    maybe FQDN: 'my.com'
    maybe FQDN: 'localhost'
    adress or none: '[::1]'
    adress or none: '127.0.0.1'
    

    1 note: coliru sadly doesn’t let me share it because it triggers spam detection with the urls. But the output can be seen there if you just copy paste and build g++ -std=c++20 -O2 -Wall -pedantic -pthread main.cpp -lboost_url && ./a.out