Search code examples
pythonhttpweb-scrapingxpathscrapy

Scrapy XPath - @href returning unexpected value


I'm currently web-scraping restaurant reviews from Tripadvisor and I'm trying to retrieve restaurant links from this page.

I want the links of the 30 restaurant pages in the bottom part but I'm making some tests with just one of them. Retrieving the first one in the list can be done with this expression:

//div[@data-test='1_list_item']/div/div[2]/div[1]/div//a/@href

Scrapy has some unexpected behaviours, the following css expression should be enough to retrieve all the links but instead, an empty array is returned:

response.css('.b::attr(href)').extract()

The same goes with many Xpath expression and by using the one above like this:

response.xpath("//div[@data-test='1_list_item']/div/div[2]/div[1]/div//a/@href").get()

I get the following link in return:

/ShowUserReviews-g187791-d25107357-r916086825-ADESSO_Vineria_Bistrot-Rome_Lazio.html

I don't know where this comes from, the link I can see in the inspect chrome console and that I expected to get in return is:

/Restaurant_Review-g187791-d25107357-Reviews-ADESSO_Vineria_Bistrot-Rome_Lazio.html


Solution

  • I solved my problem using SIM's suggested xpath expression:

    //div[contains(@data-test,'_list_item')]//div/div/div/span/a[starts-with(@href,'/Restaurant_Review')]/@href