Search code examples
pythonurlparse

Convert partial URL from parsed page to complete URL


I'm crawling through a page which have incomplete URLs and I need to convert them to complete HTTP url, for example, the original address is: http://www.example.com/dir1/dir1/ and the index file contains following links:

/page.htm
page.htm
../page.htm
../../page.htm

I need to convert them to

http://www.example.com/page.htm
http://www.example.com/dir1/dir2/page.htm
http://www.example.com/dir/page.htm
http://www.example.com/page.htm

I'm not sure how to recognize ../ and evaluate them from original address and urlparse(temp_href).geturl() doesn't work.

How to convert them correctly?


Solution

  • urljoin should do the trick for you.

    from urlparse import urljoin
    
    base = "http://www.example.com/dir1/dir1/"
    print urljoin(base, "/page.htm")
    print urljoin(base, "page.htm")
    print urljoin(base, "../page.htm")
    print urljoin(base, '../../page.htm')