I'm crawling through a page which have incomplete URLs and I need to convert them to complete HTTP url, for example, the original address is: http://www.example.com/dir1/dir1/
and the index file contains following links:
/page.htm
page.htm
../page.htm
../../page.htm
I need to convert them to
http://www.example.com/page.htm
http://www.example.com/dir1/dir2/page.htm
http://www.example.com/dir/page.htm
http://www.example.com/page.htm
I'm not sure how to recognize ../
and evaluate them from original address and urlparse(temp_href).geturl()
doesn't work.
How to convert them correctly?
urljoin should do the trick for you.
from urlparse import urljoin
base = "http://www.example.com/dir1/dir1/"
print urljoin(base, "/page.htm")
print urljoin(base, "page.htm")
print urljoin(base, "../page.htm")
print urljoin(base, '../../page.htm')