web-scraping beautifulsoup web-crawler python-requests-html

Python3 requests-html: Unfortunately, automated access to this page was denied

Hello StackOverflow community,

a few months ago, I created a scraper with python3 and html-requests together with BeautifulSoup in order to scrape car ads from https://www.mobile.de. The scraper uses the following search URL to fetch a list of all available car ads and later on iterates through the detail pages.

Please find below the code:

from bs4 import BeautifulSoup, SoupStrainer
from requests_html import HTMLSession
import re

url = 'https://suchen.mobile.de/fahrzeuge/search.html?&damageUnrepaired=NO_DAMAGE_UNREPAIRED&isSearchRequest=true&makeModelVariant1.makeId=25200&makeModelVariant1.modelId=g29&scopeId=C&sfmr=false'

session = HTMLSession()
r = session.get(url)
only_a_tags = SoupStrainer("a")
soup = BeautifulSoup(r.content,'lxml', parse_only=only_a_tags)
for link in soup.find_all('a', attrs={'href': re.compile("^https://suchen.mobile.de/fahrzeuge/details.html")}):
   print (link.get("href"))

Since a few days, the scraper is not able to fetch car ads from the website anymore. When iterating through all tags in order to fetch the detail pages of the car ads (always like https://suchen.mobile.de/fahrzeuge/details.html), currently no results are shown. In the past, links to the car ad detail pages were printed. I only receive following error message when printing the html content:

b'<!DOCTYPE html>\n<html>\n  <!--\nLeider koennen wir Dir an dieser Stelle keinen Zugriff auf unsere Daten gewaehren.\nSolltest Du weiterhin Interesse an einem Bezug unserer Daten haben, wende Dich bitte an:\n\nUnfortunately, automated access to this page was denied.\nIf you are interested in accessing our data, please contact us:\n\nPhone:\n+49 (0) 30 8109-7573\n\nMail:\nDatenpartner@team.mobile.de\n  -->\n  <head>\n    <meta charset="UTF-8">\n\n    <title>Ups, bist Du ein Mensch? / Are you a human?</title>\n        <link rel="stylesheet" href="https://static.classistatic.de/shared/mde-style/2.1.0/style.css">\n    <link rel="icon" type="image/x-icon" href="data:image/x-icon;base64,AAABAAEAEBAAAAAAAABoBAAAFgAAACgAAAAQAAAAIAAAAAEAIAAAAAAAAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAbJ7yGTV+7Msdbuv/hKzr6/347tLw8O/S8vLy0vPz89Ly8vLS8PDw0vDw79Lt7e3S6urqnNDQ0AQAAAAAXpbwGih27MEfcOv/DWbr/4iv7P//+vD/8vLy//X19f/29vb/9fX1//Pz8//y8vL/7+/v/+3t7f3q6em05OTkGTR97c4fcOv/H3Dr/w5m6/+Ir+z///rw//Ly8v/19fX/9vb2//X19f/z8/P/8vLy/+/v7//t7e3/6urq/+fn5s8hcev/H3Dr/x9w6/8OZuv/iK/s///67/////////////X19f////////////Hx8f/6+vr//v7+/+vr6//n5+f/H3Dr/x9w6/8fcOv/Dmbr/4iw7f///fX/nJ2d/5mamf/6+vr/pqen/4+Qj//4+Pj/ra6u/4WGhv/m5ub/6urq/x9w6/8fcOv/H3Dr/w5m6/+IsO3////8/y0vL/8iJCT//////0FCQv8OEBD//////1laWv8AAgL/4eHh/+3t7f8fcOv/H3Dr/x9w6/8OZuv/iLDt////+/88Pj7/MjQ0//////9PUFD/HyEh//////9kZWX/EhQU/+Li4v/t7e3/H3Dr/x9w6/8fcOv/Dmbr/4iw7f////v/PD4+/zI0NP//////T1BQ/x8hIf//////ZGVl/xIUFP/i4uL/7e3t/x9w6/8fcOv/H3Dr/w5m6/+IsO3////7/zw+Pv80Njb//////1JTU/8gIiL//////2hpaf8RExP/4uLi/+3t7f8fcOv/H3Dr/x9w6/8OZuv/iLDt////+/8/QED/ERMT/8XGxv8/QUH/BggI/7+/v/9KTEz/ExUV/+Tk5P/t7Oz/H3Dr/x9w6/8fcOv/Dmbr/4iw7f////r/SktL/y0vL/8oKir/AgMD/2FiYv82Nzf/AAAA/2BhYf/19fT/6Ojo/x9w6/8fcOv/H3Dr/w5m6/+Ir+z///rx/+jo6P/y8vL/7O3t/87Ozv/8/Pz/7e3t/8nJyf/v7u7/7Ozs/+fn5/8fcOv/H3Dr/x9w6/8OZuv/iK/s///68P/19fX/9/f3//n5+f/9/f3/9PT0//X19f/39/f/7u7u/+rq6v/n5+f/MXvs4x5w6/8fcOv/Dmbr/4iv7P//+vD/8vLy//X19f/29vb/9fX1//Pz8//y8vL/7+/v/+3t7f/q6ur/5+fmwU2M7yUoduzGH3Dr/w5m6/+Ir+z///rw//Ly8v/19fX/9vb2//X19f/z8/P/8vLy/+/v7//t7e3/6urpu+Pj4hMAAAAAc6PyGDN87McRaOv/hq7r//347f/v7+//8fHx//Ly8v/x8fH/8PDw/+/v7//t7e386+vqtuLi4RAAAAAAwAMAAIABAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAAQAAwAMAAA==">\n    <script src=\'https://www.google.com/recaptcha/api.js\'></script>\n  </head>\n  <body>\n    <header id="mdeHeader" class="header">\n      <div class="header-meta-container header-hidden-small">\n        <!-- placeholder for desktop meta -->\n      </div>\n      <div class="header-navbar clearfix">\n        <div class="header-corporate">\n          <a href="//www.mobile.de"><i class="gicon-mobilede-logo"></i></a>\n          <span class="claim header-hidden-small">Deutschlands gr\xc3\xb6\xc3\x9fter Fahrzeugmarkt</span>\n        </div>\n      </div>\n    </header>\n  <div class="g-container">\n    <h2 class="u-pad-bottom-18 u-margin-top-18">Ups, bist Du ein Mensch? / Are you a human?</h2>\n\n\n    <div id="root"></div>\n    <div class="cBox cBox--content">\n      <p><b>\n        Um fortzufahren muss dein Browser Cookies unterst\xc3\xbctzen und JavaScript aktiviert sein.<br>\n        To continue your browser has to accept cookies and has to have JavaScript enabled.</b>\n      </p>\n\n      <p>\n        Bei Problemen wende Dich bitte an:<br>\n        In case of problems please contact:\n      </p>\n      <p>\n        Phone: 030 81097-601<br>\n        Mail: service@team.mobile.de\n      </p>\n\n      <p>\n        Sollte grunds\xc3\xa4tzliches Interesse am Bezug von mobile.de Daten bestehen, wende Dich bitte an:<br/>\n        If you are primarily interested in purchasing data from mobile.de, please contact:\n      </p>\n      <p>\n        Mail: Datenpartner@team.mobile.de\n      </p>\n    </div>\n    <hr class="u-pad-top-9 u-pad-bottom-18"/>\n    <div id="footer"></div>\n    <script async src="https://www.mobile.de/api/consent/static/js/consentBanner.js"></script>\n  <script type="text/javascript" src="https://www.mobile.de/youre-blocked/app.js"></script><script type="text/javascript" >var _cf = _cf || []; _cf.push([\'_setFsp\', true]);  _cf.push([\'_setBm\', true]);  _cf.push([\'_setAu\', \'/static/16b9372bb8fti233b6fc758bf7a4291f0\']); </script><script type="text/javascript"  src="/static/16b9372bb8fti233b6fc758bf7a4291f0"></script></body>\n</html>\n'

When creating the scraper, I also received the "Unfortunately, automated access to this page was denied." message when using urrlib, hence I switched over to html-requests and everything worked great.

I already tried to solve it with following approaches, but none of them worked so far :(

proxy rotation (I thought my IP address might have been blocked)
different user agent in header via fake_useragent library

I hope you are able to help as I am currently don't know what else I could try.

Thanks a lot in advance for helping me with this issue :)

Solution

Use a Selenium Webdriver to first navigate to the search page and then run the query from there.

I just got the same message on my own machine when running your code. When I visit the site manually, I also see a reCAPTCHA. Even opening it directly with Selenium generates the reCAPTCHA.

Were I working to defeat you, I would just require the reCAPTCHA whenever a direct connection was made to search results. That would be my guess for how you are being blocked. When I use a WebDriver to first navigate to the search page, I do not get challenged.

Here is the code that I used.

from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://suchen.mobile.de/fahrzeuge/search.html")
driver.implicitly_wait(5000) #not good practice, but quick and easy
driver.find_element_by_id("gdpr-consent-accept-button").click()
driver.implicitly_wait(2000) #not good practice, but quick and easy
driver.find_element_by_id("fuels-PETROL-ds").click()
driver.implicitly_wait(2000) #not good practice, but quick and easy
driver.find_element_by_id("dsp-upper-search-btn").click()

This is not going to work forever, but it works at least for now.