Search code examples
pythonpython-3.xbeautifulsoup

Following Links in HTML using BeautifulSoup


I am doing a course which requires me to parse this using BeautifulSoup: http://python-data.dr-chuck.net/known_by_Fikret.html

The instructions are: Find the link at position 3 (the first name is 1). Follow that link. Repeat this process 4 times. The answer is the last name that you retrieve.

This is the code I have so far:

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import re

url = input('Enter - ')
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')

count = int(input('Enter count: '))
pos = int(input('Enter position: ')) - 1
urllist = list()
taglist = list()

tags = soup('a')

for i in range(count):
    for tag in tags:
        taglist.append(tag)
    url = taglist[pos].get('href', None)
    print('Retrieving: ', url)
    urllist.append(url)
print('Last URL: ', urllist[-1])

This is my output:

Retrieving:  http://python-data.dr-chuck.net/known_by_Fikret.html 
Retrieving:  http://python-data.dr-chuck.net/known_by_Montgomery.html
Retrieving:  http://python-data.dr-chuck.net/known_by_Montgomery.html
Retrieving:  http://python-data.dr-chuck.net/known_by_Montgomery.html
Retrieving:  http://python-data.dr-chuck.net/known_by_Montgomery.html
Last URL:  http://python-data.dr-chuck.net/known_by_Montgomery.html

This is the output that I am supposed to get:

Retrieving: http://python-data.dr-chuck.net/known_by_Fikret.html
Retrieving: http://python-data.dr-chuck.net/known_by_Montgomery.html
Retrieving: http://python-data.dr-chuck.net/known_by_Mhairade.html
Retrieving: http://python-data.dr-chuck.net/known_by_Butchi.html
Retrieving: http://python-data.dr-chuck.net/known_by_Anayah.html
Last URL:  http://python-data.dr-chuck.net/known_by_Anayah.html

I've been working on this for a while but I still have not been able to get the code to loop correctly. I am new to coding and I'm just looking for some help to point me in the right direction. Thanks.


Solution

  • def get_html(url):
        html = urllib.request.urlopen(url).read()
        soup = BeautifulSoup(html, 'html.parser')
        return soup
    
    url = input('Enter - ')
    count = int(input('Enter count: '))
    pos = int(input('Enter position: ')) - 1
    
    urllist = list()
    
     for i in range(count):
        taglist = list()
    
        for tag in get_html(url)('a'): # Needed to update your variable to new url html
            taglist.append(tag)
    
         url = taglist[pos].get('href', None) # You grabbed url but never updated your tags variable.
    
        print('Retrieving: ', url)
        urllist.append(url)
    
     print('Last URL: ', urllist[-1])