Search code examples
pythonscrapyscrapy-splash

PYTHON: Scraping Researchgate.net with scrapy returns 'Just a moment' instead of the author's page


For a project, I want to gather coauthorship data from researchgate.

I am completely new to webscraping, and got recommended scrapy for this project. I want to start scraping from this url (url = https://www.researchgate.net/scientific-contributions/Gregory-Phelan-2126234043), from which I would like to scrape the coauthors, after which I would like to scrape their coauthors, and so on, until I have formed a network.

I have been trying to fetch this url with Scrapy, using e.g. the fetch('url') command, and running Scrapy Shell 'url' in windows Powershell, but this returned the following:

Output after opening scrapy shell

After some research, I installed Docker and combined Scrapy and Splash. After doing this, I retried opening a Scrapy shell with the URL, but this time I ran (again in Powershell)

This first seemed to work, as the output changed to

output after opening the scrapy shell

However, after running response.css('title') to get the title, it returned

  • [Just a moment...'>]

Part of the response.text output is also:

  • span id="challenge-error-text">Enable JavaScript and cookies to continue

So to me, it seems that Scrapy somehow is unable to get to this link.

I also read about including a USER_AGENT in your shell start up, hence I first tried my own, and after this several randomly generated ones (using UserAgent()), but this did not change the outcome.

Does anyone have suggestions to succesfully fetch this link and start scraping?

I use python version 3.11.5, and scrapy version 2.11.0


Solution

  • The website you are trying to scrap is behind cloudflare services. And its very likely protecting it from bot and scrapers and detecting you as a bot which is why you are getting 403 status code and asking to enable javascript and cookies to pass "Cloudflare challenge".

    • FlareSolverr is a tool that will allow you to bypass Cloudflare challenge.
    • Try using selenium, which will open an actual browser and you'll need to combine it with Beautifulsoup to scrap, but might be a bit more complex to use than scrapy.