My code for reference:
import httplib2
from bs4 import BeautifulSoup
h = httplib2.Http('.cache')
response, content = h.request('http://csb.stanford.edu/class/public/pages/sykes_webdesign/05_simple.html')
soup = BeautifulSoup(content, "lxml")
urls = []
for tag in soup.findAll('a', href=True):
urls.append(tag['href'])
responses = []
contents = []
for url in urls:
try:
response1, content1 = h.request(url)
responses.append(response1)
contents.append(content1)
except:
pass
The idea is, I get the payload of a webpage, and then scrape that for hyperlinks. One of the links is to yahoo.com, the other to 'http://csb.stanford.edu/class/public/index.html'
However the result I'm getting from BeautifulSoup is:
>>> urls
['http://www.yahoo.com/', '../../index.html']
This presents a problem, because the second part of the script cannot be executed on the second, shortened url. Is there any way to make BeautifulSoup retrieve the full url?
That's because the link on the webpage is actually of that form. The HTML from the page is:
<p>Or let's just link to <a href=../../index.html>another page on this server</a></p>
This is called a relative link.
To convert this to an absolute link, you can use urljoin
from the standard library.
from urllib.parse import urljoin # Python3
urljoin('http://csb.stanford.edu/class/public/pages/sykes_webdesign/05_simple.html`,
'../../index.html')
# returns http://csb.stanford.edu/class/public/index.html