I'm working on a page tracking web app and I'd like to get the canonical domain for a list of sites. As far as I know there is no good way of telling where a site's ownership of subdomains and top level domains starts and ends. I'm not sure the best way to describe that, so here is an example:
If I own a personal URL, mysite.com
, I am able to set up subdomains such as www.mysite.com
, cdn.mysite.com
, and so forth.
If my "group" has a website at a university, such as computerscience.myuni.edu
, I might have also have control over www.computerscience.myuni.edu
, but not myuni.edu
If I am a huge business and and need to spread web traffic out, I might even have www.acme.com
, ww2.acme.com
, ww3.acme.com
, etc.
So nothing is certain but if I'm given a URL I can probably strip of www.
, ww2.
, and cdn.
, and maybe secure.
from the front, but are there any other common "subdomains" that I'm not thinking of that are fairly common and generally not used to serve up a different website?
I'm guess I'm just trying to figure out the best way to get the real "canonical" domain name for a site.
First of all, you should make the distinction between Domain Names and Websites/URLs. I don't think there was any efficient way to identify easily a website owner but concerning the domain name, it can be deduced through its structure.
Roughly, a Fully Qualified Domain Name is composed by the subdomain(s), the name and the suffix, and in your case, you are looking to find the canonical domain name (name + suffix).
Since the Domain Name System is hierarchical, a FQDN like www.example.com.
should be read from the end to beginning: .com.example.www
and could be decomposed this way:
com
example
www
For your identification, you should proceed in the same order:
There is no official Database listing all the public suffixes, however at the initiative of the Mozilla Foundation an unofficial one has been created. The project is named Public Suffix, which aim is to record suffixes, under which people could register domain names and have several implementations to parse the database.
I wrote an article on my personal blog introducing the domain name system, if you are interested, where I describe the domain name structure in more details: What's a domain name and what's behind the scene