Search code examples
pythonweb-scrapingbeautifulsouppython-3.6

Incorrect img alt value being outputted (Python3, Beautiful Soup 4)


I have been working on a restaurant food hygiene scraper. I have been able to get the scraper to scrape the name, address and hygiene rating for restaurants based on postcode. As the food hygiene is displayed via an image online, I have set up the scraper to read the "alt=" parameter which contains a numeric value for the food hygiene score.

The div that contains the img alt tag I target for the food hygiene ratings is shown below:

<div class="rating-image" style="clear: right;">
            <a href="/business/abbey-community-college-newtownabbey-antrim-992915.html" title="View Details">
                <img src="https://images.scoresonthedoors.org.uk//schemes/735/on_small.png" alt="5 (Very Good)">
            </a>
        </div>

I have been able to get the food hygiene score to output beside each restaurant.

My problem is though, that I noticed some of the restaurants have an incorrect reading displayed beside them, e.g. 3 instead of 4 for food hygiene rating (this is stored in an img alt tag)

The link that the scraper connects to to scrape initially is

https://www.scoresonthedoors.org.uk/search.php?name=&address=&postcode=BT367NG&distance=1&search.x=16&search.y=21&gbt_id=0

I think it might have something to do with the position of the ratings for loop inside the "for item in g_data for loop".

I have discovered if I move the

appendhygiene(scrape=[name,address,bleh])

piece of code outside the loop below

for rating in ratings:
                bleh = rating['alt']

that data is scraped correctly with the correct hygiene scores, the only issue is that not all the records are scraped, it only outputs the first 9 restaurants in this case.

I appreciate anyone that can look at my code below and provide help to solve the issue.

P.S, I used postcode BT367NG to scrape restaurants (if you tested the script you can use this to see restaurants that don't display correct hygiene values, e.g. Lins Garden is a 4 on the site, and the scraped data displays a 3).

My full code is below:

import requests
import time
import csv
import sys
from bs4 import BeautifulSoup

hygiene = []

def deletelist():
    hygiene.clear()


def savefile():
    filename = input("Please input name of file to be saved")        
    with open (filename + '.csv','w') as file:
       writer=csv.writer(file)
       writer.writerow(['Address','Town', 'Price', 'Period'])
       for row in hygiene:
          writer.writerow(row)
    print("File Saved Successfully")


def appendhygiene(scrape):
    hygiene.append(scrape)

def makesoup(url):
    page=requests.get(url)
    print(url + "  scraped successfully")
    return BeautifulSoup(page.text,"lxml")


def hygienescrape(g_data, ratings):
    for item in g_data:
        try:
            name = (item.find_all("a", {"class": "name"})[0].text)
        except:
            pass
        try:
            address = (item.find_all("span", {"class": "address"})[0].text)
        except:
            pass
        try:
            for rating in ratings:
                    bleh = rating['alt']

        except:
            pass

        appendhygiene(scrape=[name,address,bleh])








def hygieneratings():

    search = input("Please enter postcode")
    soup=makesoup(url = "https://www.scoresonthedoors.org.uk/search.php?name=&address=&postcode=" + search + "&distance=1&search.x=16&search.y=21&gbt_id=0")
    hygienescrape(g_data = soup.findAll("div", {"class": "search-result"}), ratings = soup.select('div.rating-image img[alt]'))

    button_next = soup.find("a", {"rel": "next"}, href=True)
    while button_next:
        time.sleep(2)#delay time requests are sent so we don't get kicked by server
        soup=makesoup(url = "https://www.scoresonthedoors.org.uk/search.php{0}".format(button_next["href"]))
        hygienescrape(g_data = soup.findAll("div", {"class": "search-result"}), ratings = soup.select('div.rating-image img[alt]'))

        button_next = soup.find("a", {"rel" : "next"}, href=True)


def menu():
        strs = ('Enter 1 to search Food Hygiene ratings \n'
            'Enter 2 to Exit\n' )
        choice = input(strs)
        return int(choice) 

while True:          #use while True
    choice = menu()
    if choice == 1:
        hygieneratings()
        savefile()
        deletelist()
    elif choice == 2:
        break
    elif choice == 3:
        break

Solution

  • Looks like your problem is here:

    try:
        for rating in ratings:
            bleh = rating['alt']
    
    except:
        pass
    
    appendhygiene(scrape=[name,address,bleh])
    

    What this ends up doing is appending the last value on each page. So that's why if the last value is "exempt," all values will be exempt. If the rating is a 3, all values on that page will be 3. And so on.

    What you want is to write something like this:

    try:
        bleh = item.find_all('img', {'alt': True})[0]['alt']
        appendhygiene(scrape=[name,address,bleh])
    
    except:
        pass
    

    So that each rating is appended separately, rather than simply appending the last one. I just tested it and it seemed to work :)