Search code examples
pythonweb-scrapingscrapytwistedsocks

How to write a DownloadHandler for scrapy that makes requests through socksipy?


I'm trying to use scrapy over Tor. I've been trying to get my head around how to write a DownloadHandler for scrapy that uses socksipy connections.

Scrapy's HTTP11DownloadHandler is here: https://github.com/scrapy/scrapy/blob/master/scrapy/core/downloader/handlers/http11.py

Here is an example for creating a custom download handler: https://github.com/scrapinghub/scrapyjs/blob/master/scrapyjs/dhandler.py

Here's the code for creating a SocksiPyConnection class: http://blog.databigbang.com/distributed-scraping-with-multiple-tor-circuits/

class SocksiPyConnection(httplib.HTTPConnection):
    def __init__(self, proxytype, proxyaddr, proxyport = None, rdns = True, username = None, password = None, *args, **kwargs):
        self.proxyargs = (proxytype, proxyaddr, proxyport, rdns, username, password)
        httplib.HTTPConnection.__init__(self, *args, **kwargs)

    def connect(self):
        self.sock = socks.socksocket()
        self.sock.setproxy(*self.proxyargs)
        if isinstance(self.timeout, float):
            self.sock.settimeout(self.timeout)
        self.sock.connect((self.host, self.port))

With the complexity of twisted reactors in the scrapy code, I can't figure out how plug socksipy into it. Any thoughts?

Please do not answer with privoxy-like alternatives or post answers saying "scrapy doesn't work with socks proxies" - I know that, which is why I'm trying to write a custom Downloader that makes requests using socksipy.


Solution

  • I was able to make this work with https://github.com/habnabit/txsocksx.

    After doing a pip install txsocksx, I needed to replace scrapy's ScrapyAgent with txsocksx.http.SOCKS5Agent.

    I simply copied the code for HTTP11DownloadHandler and ScrapyAgent from scrapy/core/downloader/handlers/http.py, subclassed them and wrote this code:

    class TorProxyDownloadHandler(HTTP11DownloadHandler):
    
        def download_request(self, request, spider):
            """Return a deferred for the HTTP download"""
            agent = ScrapyTorAgent(contextFactory=self._contextFactory, pool=self._pool)
            return agent.download_request(request)
    
    
    class ScrapyTorAgent(ScrapyAgent):
        def _get_agent(self, request, timeout):
            bindaddress = request.meta.get('bindaddress') or self._bindAddress
            proxy = request.meta.get('proxy')
            if proxy:
                _, _, proxyHost, proxyPort, proxyParams = _parse(proxy)
                scheme = _parse(request.url)[0]
                omitConnectTunnel = proxyParams.find('noconnect') >= 0
                if  scheme == 'https' and not omitConnectTunnel:
                    proxyConf = (proxyHost, proxyPort,
                                 request.headers.get('Proxy-Authorization', None))
                    return self._TunnelingAgent(reactor, proxyConf,
                        contextFactory=self._contextFactory, connectTimeout=timeout,
                        bindAddress=bindaddress, pool=self._pool)
                else:
                    _, _, host, port, proxyParams = _parse(request.url)
                    proxyEndpoint = TCP4ClientEndpoint(reactor, proxyHost, proxyPort,
                        timeout=timeout, bindAddress=bindaddress)
                    agent = SOCKS5Agent(reactor, proxyEndpoint=proxyEndpoint)
                    return agent
    
            return self._Agent(reactor, contextFactory=self._contextFactory,
                connectTimeout=timeout, bindAddress=bindaddress, pool=self._pool)
    

    In settings.py, something like this is needed:

    DOWNLOAD_HANDLERS = {
        'http': 'crawler.http.TorProxyDownloadHandler'
    }
    

    Now proxying with Scrapy with work through a socks proxy like Tor.