Search code examples
pythonpython-3.xweb-scrapingbeautifulsoupdata-cleaning

Basic Python BeautifulSoup web-scraping Tripadvisor reviews and data cleaning


I'm a complete beginner at programming and to StackOverflow and I just need to do some basic web-scraping from a TripAdvisor page and clean some useful information from it. Display it nicely etc. I'm trying to isolate the title of the cafe, the number of ratings and the rating itself. I'm thinking I might need to convert it to text and use regex or something? I really don't know. An example of what I mean would be:

Output:

Coffee Cafe, 4 out of 5 bubbles, 201 reviews.

Something like that. I will put my code so far below, any help I could get would amazing and I would be infinitely grateful. Cheers.

from bs4 import BeautifulSoup

def get_HTML(url):
    response = urllib.request.urlopen(url)
    html = response.read()
    return html


Tripadvisor_reviews_HTML=get_HTML(
'https://www.tripadvisor.com.au/Restaurants- 
 g255068-c8-Brisbane_Brisbane_Region_Queensland.html')


def get_review_count(HTML):
    soup = BeautifulSoup(Tripadvisor_reviews_HTML, "lxml")
    for element in soup(attrs={'class' : 'reviewCount'}):
        print(element)

get_review_count(Tripadvisor_reviews_HTML)

def get_review_score(HTML):
    soup = BeautifulSoup(Tripadvisor_reviews_HTML, "lxml")
    for four_point_five_score in soup(attrs={'alt' : '4.5 of 5 bubbles'}):
        print(four_point_five_score)


get_review_score(Tripadvisor_reviews_HTML)

def get_cafe_name(HTML):
    soup = BeautifulSoup(Tripadvisor_reviews_HTML, "lxml")
    for name in soup(attrs={'class' : "property_title"}):
        print(name)



get_cafe_name(Tripadvisor_reviews_HTML)

Solution

  • You forgot to use .text in every print statement. However, try the below approach to get all the three fields from that site.

    from bs4 import BeautifulSoup
    import urllib.request
    
    URL = "https://www.tripadvisor.com.au/Restaurants-g255068-c8-Brisbane_Brisbane_Region_Queensland.html"
    
    def get_info(link):
        response = urllib.request.urlopen(link)
        soup = BeautifulSoup(response.read(),"lxml")
        for items in soup.find_all(class_="shortSellDetails"):
            name = items.find(class_="property_title").get_text(strip=True)
            bubble = items.find(class_="ui_bubble_rating").get("alt")
            review = items.find(class_="reviewCount").get_text(strip=True)
            print(name,bubble,review)
    
    if __name__ == '__main__':
        get_info(URL)
    

    Result you may get like:

    Double Shot New Farm 4.5 of 5 bubbles 218 reviews
    Goodness Gracious Cafe 4.5 of 5 bubbles 150 reviews
    New Farm Deli & Cafe 4.5 of 5 bubbles 273 reviews
    Coffee Anthology 4.5 of 5 bubbles 116 reviews