Search code examples
urlpython-3.xnormalization

Canonicalize / normalize a URL?


I am searching for a library function to normalize a URL in Python, that is to remove "./" or "../" parts in the path, or add a default port or escape special characters and so on. The result should be a string that is unique for two URLs pointing to the same web page. For example http://google.com and http://google.com:80/a/../ shall return the same result.

I would prefer Python 3 and already looked through the urllib module. It offers functions to split URLs but nothing to canonicalize them. Java has the URI.normalize() function that does a similar thing (though it does not consider the default port 80 equal to no given port), but is there something like this is python?


Solution

  • Following the good start, I composed a method that fits most of the cases commonly found in the web.

    def urlnorm(base, link=''):
      '''Normalizes an URL or a link relative to a base url. URLs that point to the same resource will return the same string.'''
      new = urlparse(urljoin(base, url).lower())
      return urlunsplit((
        new.scheme,
        (new.port == None) and (new.hostname + ":80") or new.netloc,
        new.path,
        new.query,
        ''))