Search code examples
pythonstringurlparse

How can I detect subdomains by analyzing a URL?


I've a couple of websites that are subdomains (e.g., Wordpress, Altervista, Blogpress,...).

I'm currently using url parse for splitting URLs into their elements. However it seems that does not allow to distinguish subdomains, but only tld.

Alternatively, I'd use a vocabulary to include all the subdomain suffixes and, based on that, assign 1 or 0. But since I don't know all the blogs, I'm wondering if there is a way to make automatically the detection.

For example, I was thinking of looking at the dots, but many websites can have a dot in between not being subdomains, so this approach is not good.


Solution

  • I think this library should do the trick https://pypi.org/project/tld/.

    Here's an example:

    from tld import get_tld
    url = "https://artgateblog.altervista.org/"
    res = get_tld(url, as_object=True)
    blogname, blog_domain = res.domain, res
    print(blogname, blog_domain)
    

    Out:

    artgateblog altervista.org
    

    EDIT after comments:

    For domains that don't include protocol, I think you need to add it with something like the below:

    from tld import get_tld
    urls = ["12story.altervista.org", "fantasy_story.blogspot.com"]
    for url in urls:
        res = get_tld(url, as_object=True, fix_protocol=True)
        blogname, blog_domain = res.domain, res