Search code examples
proxyscrapysplash-screen

Running scrapy splash with proxies


I am using proxy in scrapy splash, but I get 502 proxy all the time, it troubles me several days.

my downloadmiddleware:

class ABProxyMiddleware(HttpProxyMiddleware):
""" 阿布云ip代理配置 """
proxyAuth = "Basic " + base64.urlsafe_b64encode(
    bytes((settings['PROXY_USER'] + ":" + settings['PROXY_PASS']), "ascii")).decode("utf-8")

def process_request(self, request, spider):
    request.meta['splash']['args']['proxy'] = settings['PROXY_SERVER']
    request.headers['Proxy-Authorization'] = self.proxyAuth

my requests:

yield SplashRequest(url= 'http://www.qidian.com/all?chanId=4&subCateId=130&orderId=&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=0&page=' + str(
                i),callback=self.book_parse, endpoint='render.html')

my settings:

DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
'tempScrapy.middlewares.ABProxyMiddleware': 100,

}

I am sure that all the setting about proxy is right,and the proxy is valid,for it will be successful with out splash


Solution

  • According to your code, you're sending the proxy authentication headers to the Splash server:

    +-------------+
    | Your spider |
    +------+------+
           |
           | Proxy Authentication
           v
    +------+-------+
    |   Splash     |
    +------+-------+
           |
           |
           v
    +------+-------+
    | Proxy server |
    +------+-------+
           |
           |
           v
    +------+-------+
    | Target site  |
    +--------------+
    

    The Splash server would simply ignore the proxy authentication header you send, and thus the proxy server would reject your request due to unsuccessful authentication.

    The right thing to do is to have Splash send the proxy authentication header:

    +-------------+
    | Your spider |
    +------+------+
           |
           |
           v
    +------+-------+
    |   Splash     |
    +------+-------+
           |
           | Proxy Authentication
           v
    +------+-------+
    | Proxy server |
    +------+-------+
           |
           |
           v
    +------+-------+
    | Target site  |
    +--------------+
    

    So you'll need to remove this line:

    request.headers['Proxy-Authorization'] = self.proxyAuth
    

    and properly configure the proxy info:

    request.meta['splash']['args']['proxy'] = 'proxy info of format: [protocol://][user:password@]proxyhost[:port]'
    

    See also: API reference of Splash (look for the proxy argument)