web-scraping docker-compose scrapy scrapy-splash zyte

Requests fail with 504: Gateway Time-out when using scrapy-splash in docker compose with zyte

I'm trying to scrape one site which partially renders content using JS.

I went ahead and found this project: https://github.com/scrapinghub/sample-projects/tree/master/splash_smart_proxy_manager_example, which quite neatly explains how to set things out. Here's what I have right now:

Docker compose:

version: '3.8'

services:
    scraping:
        build:
            context: .
            dockerfile: Dockerfile
        volumes:
            - "./scraping:/scraping"
        environment:
            - PYTHONUNBUFFERED=1
        depends_on:
            - splash
        links:
            - splash
    splash:
        image: scrapinghub/splash
        restart: always
        expose:
            - 5023
            - 8050
            - 8051
        ports:
            - "5023:5023"
            - "8050:8050"
            - "8051:8051"

spider:

class HappySider(scrapy.Spider):
    ...
    custom_settings = {
        'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
        'SPIDER_MIDDLEWARES': {
            'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
        },
        'DOWNLOADER_MIDDLEWARES': {
            'scrapy_splash.SplashCookiesMiddleware': 723,
            'scrapy_splash.SplashMiddleware': 725,
            'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
        },
        'ITEM_PIPELINES': {
            'scraping.pipelines.HappySpiderPipeline': 300,
        },
        'RETRY_HTTP_CODES': [500, 502, 503, 504, 522, 524, 408, 429, 403],
        'RETRY_TIMES': 20,
        'DOWNLOAD_DELAY': 5,
        'DOWNLOAD_TIMEOUT': 30,
        'CONCURRENT_REQUESTS': 1,
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        'COOKIES_ENABLED': False,
        'ROBOTSTXT_OBEY': True,
        # enable Zyte Proxy
        'ZYTE_SMARTPROXY_ENABLED': True,
        # the APIkey you get with your subscription
        'ZYTE_SMARTPROXY_APIKEY': '<my key>',
        'SPLASH_URL': 'http://splash:8050/',
    }

    def __init__(self, testing=False, name=None, **kwargs):
        self.LUA_SOURCE = get_data(
            'scraping', 'scripts/smart_proxy_manager.lua'
        ).decode('utf-8')
        super().__init__(name, **kwargs)

    def start_requests(self):

        yield SplashRequest(
            url='https://www.someawesomesi.te',
            endpoint='execute',
            args={
                'lua_source': self.LUA_SOURCE,
                'crawlera_user': self.settings['ZYTE_SMARTPROXY_APIKEY'],
                'timeout': 90,
            },
            # tell Splash to cache the lua script, to avoid sending it for every request
            cache_args=['lua_source'],
            meta={
                'max_retry_times': 10,
            },
            callback=self.my_callback
        )

And the output I get is:

2022-08-10 13:09:32 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.someawesomesi.te via http://splash:8050/execute> (failed 1 times): 504 Gateway Time-out

Not sure how to proceed here. I did look out why it would be giving 504 to me and splash docks does suggest some ways of handling it... but I don't have many concurrent URLs and the script fails with the very first one. Plus, the site I'm scraping is very fast, and if I just use Zyte without splash, then it scrapes very fast.

So If anybody can suggest what's wrong here and how to fix it - I'd greatly appreciate it.

Solution

Splash is getting deprecated soon. You can use headless browser libraries for rendering JS along with Smart Proxy Manager. Zyte recently launched three headless browser libraries.

These client libraries are built on top of their native libraries for web automation across Chromium, Firefox, and WebKit, written to work seamlessly with Zyte Smart Proxy Manager. Using these library, you will no longer have to maintain a separate piece of software(like splash) running in the background to help connect with Zyte Smart Proxy Manager.

My recommendation would be to use Zyte API. Zyte API is an end-to-end API solution that executes all tasks in the web-scraping sequence. It can extract dynamically-loaded web page content without spending time recreating what the browser does through JavaScript, headless browser libraries and additional requests.

For this particular solution, follow this documentation. Just Set javascript parameter: to

Turn JavaScript ON or OFF during browser rendering. And it just works...

I work as a Developer Advocate @Zyte.