Search code examples
pythonweb-scrapingscrapyuser-agenthttp-status-code-403

Python Scrapy - (403) status code is not handled or not allowed


I'm trying to scrape reviews from Tripadvisor, more specifically from this address.

I'm currently unable to scrape any data and I'm returned the 403 status code. At first I tried the usual command scrape crawl reviews without success. I then tried to make some tests with scrape shell 'website address' and received the same 403 status. Any extract() attempt returns an empty array.

I've looked up some guides online and installed scrapy-user-agents and inserted the correct Downloader Middlewares in the settings.py file as indicated in the linked page. The scraper now tries to crawl the website with a set of fake user-agents but for each one of them I get the error:

[scrapy_user_agents.user_agent_picker] WARNING: [UnsupportedBrowserType]

or the error:

[scrapy_user_agents.user_agent_picker] WARNING: [UnsupportedDeviceType]

and 0 pages are crawled.

Anyone with some experience in scraping Tripadvisor has any idea on how to solve this problem?


Solution

  • I solved the problem by setting a static user-agent in the settings.py file. Scrapy already offers a sample for this but it is commented. I just uncommented it:

    USER_AGENT = "reviewscraper (+http://www.yourdomain.com)"