Search code examples
pythonhtmlurl-parsing

Reconstructing absolute urls from relative urls on a page


Given an absolute url of a page, and a relative link found within that page, would there be a way to a) definitively reconstruct or b) best-effort reconstruct the absolute url of the relative link?

In my case, I'm reading an html file from a given url using beautiful soup, stripping out all the img tag sources, and trying to construct a list of absolute urls to the page images.

My Python function so far looks like:

function get_image_url(page_url,image_src):

    from urlparse import urlparse
    # parsed = urlparse('http://user:pass@NetLoc:80/path;parameters?query=argument#fragment')
    parsed = urlparse(page_url)
    url_base = parsed.netloc
    url_path = parsed.path

    if src.find('http') == 0:
        # It's an absolute URL, do nothing.
        pass
    elif src.find('/') == 0:
        # If it's a root URL, append it to the base URL:
        src = 'http://' + url_base + src
    else:
        # If it's a relative URL, ?

NOTE: Don't need a Python answer, just the logic required.


Solution

  • very simple:

    >>> from urlparse import urljoin
    >>> urljoin('http://mysite.com/foo/bar/x.html', '../../images/img.png')
    'http://mysite.com/images/img.png'