Goal: I am trying to scrape the HTML from this page: https://www.doherty.jobs/jobs/search?q=&l=&lat=&long=&d=.
(note - I will eventually want to paginate and scrape all job listings from this page)
My issue: I get a 503 error when I try to scrape the page using Python and Requests. I am working out of Google Colab.
Initial Code:
import requests
url = 'https://www.doherty.jobs/jobs/search?q=&l=&lat=&long=&d='
response = requests.get(url)
print(response)
Attempted solutions:
import requests
def getUrl(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36',
}
res = requests.get(url, headers=headers)
res.raise_for_status()
getUrl('https://www.doherty.jobs/jobs/search?q=&l=&lat=&long=&d=')
I am able to access the website via my browser.
Is there anything else I can try?
Thank you
That page is protected by cloudflare, there's some options to try to bypass it, seems that using cloudscraper works:
import cloudscraper
scraper = cloudscraper.create_scraper()
url = 'https://www.doherty.jobs/jobs/search?q=&l=&lat=&long=&d='
response = scraper.get(url).text
print(response)
In order to use it, you'll need to install it:
pip install cloudscraper