Search code examples
rdomain-namepublic-suffix-list

Return root domain from url in R


Given website addresses, e.g.

http://www.example.com/page1/#
https://subdomain.example2.co.uk/asdf?retrieve=2

How do I return the root domain in R, e.g.

example.com
example2.co.uk

For my purposes I would define the root domain to have structure

example_name.public_suffix

where example_name excludes "www" and public_suffix is on the list here:

https://publicsuffix.org/list/effective_tld_names.dat

Is this still the best regex based solution:

https://stackoverflow.com/a/8498629/2109289

What about something in R that parses root domain based off the public suffix list, something like:

http://simonecarletti.com/code/publicsuffix/

Edited: Adding extra info based on Richard's comment

Using XML::parseURI seems to return the stuff between the first "//" and "/". e.g.

> parseURI("http://www.blog.omegahat.org:8080/RCurl/index.html")$server
[1] "www.blog.omegahat.org"

Thus, the question reduces to having an R function that can return the public suffix from the URI, or implementing the following algorithm on the public suffix list:

Algorithm
  • Match domain against all rules and take note of the matching ones.
  • If no rules match, the prevailing rule is "*".
  • If more than one rule matches, the prevailing rule is the one which is an exception rule.
  • If there is no matching exception rule, the prevailing rule is the one with the most labels.
  • If the prevailing rule is a exception rule, modify it by removing the leftmost label.
  • The public suffix is the set of labels from the domain which directly match the labels of the prevailing rule (joined by dots).
  • The registered or registrable domain is the public suffix plus one additional label.

Solution

  • There are two tasks here. The first is parsing the URL to get the host name, which can be done with the httr package's parse_url function:

    host <- parse_url("https://subdomain.example2.co.uk/asdf?retrieve=2")$hostname
    host
    # [1] "subdomain.example2.co.uk"
    

    The second is extracting the organizational domain (or root domain, top private domain--whatever you want to call it). This can be done using the tldextract package (which is inspired by the Python package of the same name and uses Mozilla's public suffix list):

    domain.info <- tldextract(host)
    domain.info
    #                       host subdomain   domain   tld
    # 1 subdomain.example2.co.uk subdomain example2 co.uk
    

    tldextract returns a data frame, with a row for each domain you give it, but you can easily paste together the relevant parts:

    paste(domain.info$domain, domain.info$tld, sep=".")
    # [1] "example2.co.uk"