Search code examples
pythonweb-scrapingbeautifulsouppython-requeststor

Python Tor: Request could not be satisfied/ Request blocked


I'm trying to make request from the link below using Tor but it returns error. Making requests without Tor works perfectly fine but I still need them to be in Tor or maybe randomized IP.

Am I doing this right? Or there is a better solution to this.

link = 'https://www.totallylegal.com/searchjobs/'
import requests
torport = 9050
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36',
    'accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
}
proxies = {
    'http': "socks5h://localhost:{}".format(torport),
    'https': "socks5h://localhost:{}".format(torport)
}

print(requests.get(link,headers=headers, proxies=proxies).content)

Below is the error that shows up:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<HTML><HEAD><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<TITLE>ERROR: The request could not be satisfied</TITLE>
</HEAD><BODY>
<H1>403 ERROR</H1>
<H2>The request could not be satisfied.</H2>
<HR noshade size="1px">
Request blocked.

<BR clear="all">
<HR noshade size="1px">
<PRE>
Generated by cloudfront (CloudFront)
Request ID: iXaDPfPtyHg0TGTFJvYuAnV86unJIpBITxdBJ2w_i_bo-ToR510p2w==
</PRE>
<ADDRESS>
</ADDRESS>
</BODY></HTML>

Solution

  • The page seems to blocklist Tor Ip addressees, so we can circumvent this by going through another site, e.g. W3 validator, that is showing source to us: https://validator.w3.org/nu/?showsource=yes&doc=https%3A%2F%2Fwww.totallylegal.com%2Fsearchjobs%2F

    We're still using TOR, but letting other site to fetch the site for us (and their IP isn't blocked):

    from bs4 import BeautifulSoup
    import requests
    
    proxies = {
        'http': 'http://<YOUR PROXY ADDRESS>:<YOUR PROXY PORT>',
        'https': 'http://<YOUR PROXY ADDRESS>:<YOUR PROXY PORT>',
    }
    
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36',
        'accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    }
    
    r = requests.get('https://validator.w3.org/nu/?showsource=yes&doc=https%3A%2F%2Fwww.totallylegal.com%2Fsearchjobs%2F', proxies=proxies, headers=headers)
    soup = BeautifulSoup(r.text, 'lxml')
    source_code = ''
    for code in soup.select('ol.source > li > code'):
        if 'class' in code.attrs and 'lf' in code.attrs['class']:
            source_code += '\n'
        else:
            source_code += code.text
    
    soup2 = BeautifulSoup(source_code, 'lxml')
    
    for li in soup2.select('li.lister__item h3'):
        print(li.text)
        print('-' * 80)
    

    Prints:

    Corporate Partner
    --------------------------------------------------------------------------------
    Personal Injury Paralegal
    --------------------------------------------------------------------------------
    Healthcare Regulatory Lawyer - London
    --------------------------------------------------------------------------------
    Company Secretary and Corporate Governance
    --------------------------------------------------------------------------------
    Junior FCPA/Compliance Associate, Beijing - 14612/TTL
    --------------------------------------------------------------------------------
    International Project Manager, Shanghai - 14611/TTL
    --------------------------------------------------------------------------------
    Corporate Associate (4+ PQE) Beijing - 14610/TTL
    --------------------------------------------------------------------------------
    Corporate Associate (5+ PQE) Shanghai - 14609/TTL
    --------------------------------------------------------------------------------
    Corporate or Commercial Counsel -Pharma- Surrey
    --------------------------------------------------------------------------------
    Corporate/Public M&A PSL, 5+ PQE
    --------------------------------------------------------------------------------
    Solicitor
    --------------------------------------------------------------------------------
    In-house Legal Counsel - Excellent opportunity to go In-House!
    --------------------------------------------------------------------------------
    Real Estate Partner
    --------------------------------------------------------------------------------
    Child Brain Injury Solicitor
    --------------------------------------------------------------------------------
    Corporate/Commercial In-House Lawyer, 1+
    --------------------------------------------------------------------------------
    In-house Regulatory Counsel, Banking/Payments, 5+
    --------------------------------------------------------------------------------
    In-house Property Finance/Banking Lawyer, 1-3
    --------------------------------------------------------------------------------
    Hybrid Legal & Compliance Data Protection Manager
    --------------------------------------------------------------------------------
    Hedge Fund Legal Counsel 3-5 years PQE
    --------------------------------------------------------------------------------
    Corporate PSL
    --------------------------------------------------------------------------------