Search code examples
python-3.xbeautifulsouphttplib2

BeautifulSoup returns urls of pages on same website shortened


My code for reference:

import httplib2
from bs4 import BeautifulSoup

h = httplib2.Http('.cache')
response, content = h.request('http://csb.stanford.edu/class/public/pages/sykes_webdesign/05_simple.html')
soup = BeautifulSoup(content, "lxml")
urls = []
for tag in soup.findAll('a', href=True):
    urls.append(tag['href'])
responses = []
contents = []
for url in urls:
    try:
        response1, content1 = h.request(url)
        responses.append(response1)
        contents.append(content1)
    except:
        pass

The idea is, I get the payload of a webpage, and then scrape that for hyperlinks. One of the links is to yahoo.com, the other to 'http://csb.stanford.edu/class/public/index.html'

However the result I'm getting from BeautifulSoup is:

>>> urls
['http://www.yahoo.com/', '../../index.html']

This presents a problem, because the second part of the script cannot be executed on the second, shortened url. Is there any way to make BeautifulSoup retrieve the full url?


Solution

  • That's because the link on the webpage is actually of that form. The HTML from the page is:

    <p>Or let's just link to <a href=../../index.html>another page on this server</a></p>

    This is called a relative link.

    To convert this to an absolute link, you can use urljoin from the standard library.

    from urllib.parse import urljoin  # Python3
    
    urljoin('http://csb.stanford.edu/class/public/pages/sykes_webdesign/05_simple.html`,
            '../../index.html')
    # returns http://csb.stanford.edu/class/public/index.html