Search code examples
pythonbeautifulsouphref

Scrape href not working with python


I have copies of this very code that I am trying to do and every time I copy it line by line it isn't working right. I am more than frustrated and can't seem to figure out where it is not working. What I am trying to do is go to a website, scrap the different ratings pages which are labelled A, B, C ... etc. Then I am going to each site to pull the total number of pages they are using. I am trying to scrape the <span class='letter-pages' href='/ratings/A/1' and so on. What am I doing wrong?

import requests
from bs4 import BeautifulSoup
url = "https://www.brightscope.com/ratings/"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
hrefs = []
ratings = []
ks = []
pages_scrape = []

for href in soup.findAll('a'):
    if 'href' in href.attrs:
        hrefs.append(href.attrs['href'])
for good_ratings in hrefs:
    if good_ratings.startswith('/ratings/'):
        ratings.append(url[:-9]+good_ratings)
# elif good_ratings.startswith('/401k'):
#     ks.append(url[:-9]+good_ratings)
del ratings[0]
del ratings[27:]
print(ratings)

for each_rating in ratings:
    page  = requests.get(each_rating)
    soup = BeautifulSoup(page.text, 'html.parser')
    for href in soup.find('span', class_='letter-pages'):
        #Not working Here
        pages_scrape.append(href.attrs['href'])
        # Will print all the anchor tags with hrefs if I remove the above comment.
        print(href)

Solution

  • You are trying to get the href prematurely. You are trying to extract the attribute directly from a span tag that has nested a tags, rather than a list of a tags.

    for each_rating in ratings:
        page  = requests.get(each_rating)
        soup = BeautifulSoup(page.text, 'html.parser')
        span = soup.find('span', class_='letter-pages')
        for a in span.find_all('a'):
            href = a.get('href')
            pages_scrape.append(href)
    

    I didn't test this on all pages, but it worked for the first one. You pointed out that on some of the pages the content wasn't getting scraped, which is due to the span search returning None. To get around this you can do something like:

    for each_rating in ratings:
        page  = requests.get(each_rating)
        soup = BeautifulSoup(page.text, 'html.parser')
        span = soup.find('span', class_='letter-pages')
        if span:
            for a in span.find_all('a'):
                href = a.get('href')
                pages_scrape.append(href)
                print(href)
        else:
            print('span.letter-pages not found on ' + page)
    

    Depending on your use case you might want to do something different, but this will indicate to you which pages don't match your scraping model and need to be manually investigated.