Search code examples
pythonurlpython-2.7domain-name

How to get the domainname (name+TLD) from a URL in python


I want to extract the domain name(name of the site+TLD) from a list of URLs which may vary in their format. for instance: Current state---->what I want

mail.yahoo.com------> yahoo.com
account.hotmail.co.uk---->hotmail.co.uk
x.it--->x.it
google.mail.com---> google.com

Is there any python code that can help me with extracting what I want from URL or should I do it manually?


Solution

  • This is somewhat non-trivial, as there is no simple rule to determine what makes a for a valid public suffix (site name + TLD). Instead, what makes a public suffix is maintained as a list at PublicSuffix.org.

    A python package exists that queries that list (stored locally); it's called publicsuffix:

    >>> from publicsuffix import PublicSuffixList
    >>> psl = PublicSuffixList()
    >>> print psl.get_public_suffix('mail.yahoo.com')
    yahoo.com
    >>> print psl.get_public_suffix('account.hotmail.co.uk')
    hotmail.co.uk