BeautifulSoup returns urls of pages on same website shortened

My code for reference:

import httplib2
from bs4 import BeautifulSoup

h = httplib2.Http('.cache')
response, content = h.request('http://csb.stanford.edu/class/public/pages/sykes_webdesign/05_simple.html')
soup = BeautifulSoup(content, "lxml")
urls = []
for tag in soup.findAll('a', href=True):
    urls.append(tag['href'])
responses = []
contents = []
for url in urls:
    try:
        response1, content1 = h.request(url)
        responses.append(response1)
        contents.append(content1)
    except:
        pass

The idea is, I get the payload of a webpage, and then scrape that for hyperlinks. One of the links is to yahoo.com, the other to 'http://csb.stanford.edu/class/public/index.html'

However the result I'm getting from BeautifulSoup is:

>>> urls
['http://www.yahoo.com/', '../../index.html']

This presents a problem, because the second part of the script cannot be executed on the second, shortened url. Is there any way to make BeautifulSoup retrieve the full url?

Solution

That's because the link on the webpage is actually of that form. The HTML from the page is:

<p>Or let's just link to <a href=../../index.html>another page on this server</a></p>

This is called a relative link.

To convert this to an absolute link, you can use urljoin from the standard library.

from urllib.parse import urljoin  # Python3

urljoin('http://csb.stanford.edu/class/public/pages/sykes_webdesign/05_simple.html`,
        '../../index.html')
# returns http://csb.stanford.edu/class/public/index.html