Search code examples
pythonweb-scrapingscrapyuser-agent

Blocked from scraping a website with Scrapy?


I'm still trying to scrape search results from this kind of URL, which is the search results for a Chinese online newspaper. Scrapy works for a few requests, and then I get the following terminal output.

2019-12-19 11:56:19 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <461 http://so.news.cn/getNews?keyword=%E7%BE%8E%E5%9B%BD&curPage=55&sortField=0&searchFields=0&lang=cn>: HTTP status code is not handled or not allowed

It seems to work better if I add a delay, but then it is very slow. Is this because I am being blocked by the site - and is there anything I can do about it? I don't currently have any special User-Agent defined in settings.py. I have tried using scrapy-UserAgent to rotate User-Agent, but it doesn't seem to be working. Would a VPN help?

Thanks


Solution

  • Different solutions to test :

    • Random pause between each requests
    • Make good use of sessions:

      1) Keep the same session for an amount of request (30 to 60)

      2) Clear your cookies after 30 to 60 request and change the user agent. Use this simple python framework: https://pypi.org/project/shadow-useragent/

      3) If that still does not work: rotate your IP over time (every 30 to 60 requests for instance) thanks to a proxy provider, rotate your user-agent, clear your cookies at the same time.

    You should now look random for most of the websites. If you see any more bot mitigation (recaptchas) or specialized anti-scraping services, this could get trickier.