Search code examples
pythonhttpweb-scrapingxpathscrapy

Scrapy - xpath returns empty list


I'm scraping restaurant reviews from yelp, specifically from this url

I'm trying to get the list of review containers and, after testing with the chrome console, that would be given by the following xpath expression:

//li/div[@class='css-1qn0b6x']

However, by testing with scrapy shell, the following command returns an empty list

response.xpath("//li/div[@class='css-1qn0b6x']").extract()


Solution

  • Continuing from the comments above, see an example of how you can gather reviews by first getting the yelp-biz-id from the HTML of the original page you linked:

    import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    
    s = requests.Session()
    
    url = 'https://www.yelp.it/biz/roscioli-roma-4'
    headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"}
    
    resp = s.get(url,headers=headers)
    soup = BeautifulSoup(resp.text,'html.parser')
    biz_id = soup.find('meta',{'name':'yelp-biz-id'})['content']
    
    reviews = []
    for page in range(5):
    
        api_url = f'https://www.yelp.it/biz/{biz_id}/review_feed?start={page*10}'
        resp  = s.get(api_url,headers=headers)
        
        data = resp.json()
        reviews = data['reviews']
    
        if resp.status_code == 200 and len(reviews) > 0:
    
            df = pd.json_normalize(reviews)
            reviews.append(df)
    
    final_df = pd.concat(reviews).reset_index()
    final_df