Search code examples
pythonseleniumweb-scrapingscrapyscrapy-splash

Load crawl a huge webpage with Scrapy-Splash


My system specs: Ubuntu 17.10, 4 gb RAM, 50 gb swap

My goal in short

I would like to crawl all 24.453 records from https://www.sanego.de/Arzt/Allgemeine+Chirurgie/.

The Problem

I cannot load the page, seemingly due to its size

More details about the content of the page

Initially the webpage displays only first 30 records. By clicking a button 'title="Mehr anzeigen"' one time I can load another +30 more records. This can be repeated until all the records are loaded. So, its generated dynamically with javascript.

My general strategy

My idea is to press the button 'title="Mehr anzeigen"' as many times as it is needed for all the 24.453 records to be displayed on the page. Once it's done, I will be able to parse the page and collect all the records.

Scrapy + Selenium approach

I tried two different spiders for this. First, I tried to do that with writing a Scrapy spider implementing Selenium for rendering the dynamic content. However, this solution turned out to be too costly in terms of memory usage. The process eats all the RAM and crashes after around 1500 records is loaded

Scrapy + Splash approach

I assumed that this solution may be faster and less memory demanding then the previous one, however, the page loading exceeds Splash's max timeout limit of 3600 seconds and the spider crashes. I'll provide only this spider's code below, since I feel that Splash may be a better solution for this case. Please, ask if you'd like me to add the other one's too.

Limit memory usage

I ran each of the spiders in cgroups imposing memory limit of 1gb. The spides stay within the memory limits, but crash anyway before the page is fully loaded.

Question

Please, provide me with any suggestions on how can I acheive the goal

Code

That's how I start splash:

sudo cgexec -g memory:limitmem docker run -it --memory="1024m"
--memory-swappiness="100" -p 8050:8050 scrapinghub/splash --max-timeout 3600

That's how I run the spider:

sudo cgexec -g memory:limitmem scrapy crawl spersonel_spider

Main part of the spider:

from scrapy_splash import SplashRequest
import time
import json     
import scrapy
from scrapy import Request
from sanego.items import PersonelItem

class SanegoSpider(scrapy.Spider):

    name = "spersonel_spider"

    start_urls = ['https://www.sanego.de/Arzt/Fachgebiete/','https://www.sanego.de/Zahnarzt/Fachgebiete/', 'https://www.sanego.de/Heilpraktiker/Fachgebiete/', 'https://www.sanego.de/Tierarzt/Fachgebiete/',]  

    def parse(self, response):

        search_urls = ["https://www.sanego.de" + url for url in response.xpath('//ul[@class="itemList"]/li[contains(@class,"col-md-4")]/a/@href').extract()]

        script = """
        function main(splash)

            local url = splash.args.url
            splash.images_enabled = false

            assert(splash:go(url))
            assert(splash:wait(1))

            local element = splash:select('.loadMore')
            while element ~= nil do
                assert(element:mouse_click())
                assert(splash:wait{2,cancel_on_error=true})
                element = splash:select('.loadMore')
            end
            return {
                html = splash:html(),
                --png = splash:png(),
                --har = splash:har(),
            }
        end
        """

        for url in search_urls:
            if url == 'https://www.sanego.de/Arzt/Allgemeine+Chirurgie/':
                yield SplashRequest(url, self.parse_search_results, args={'wait': 2, 'lua_source': script, 'timeout':3600},endpoint='execute', headers={'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/603.3.8 (KHTML, like Gecko) Version/10.1.2 Safari/603.3.8'})

Solution

  • That page loads more data on AJAX, so simulate AJAX using simple Scrapy, without using Splash.

    import requests
    
    cookies = {
        'sanego_sessid': 'meomk0luq31rcjl5qp38tsftp1',
        'AWSELB': '0D1143B71ECAB811932E9F0030D39880BEAC9BABBC8CD3C44A99B4B781E433D347A4C2A6FDF836A5F4A4BE16334FBDA671EC87316CB08EB740C12A444F7E4A1EE15E3F26E2',
        '_ga': 'GA1.2.882998560.1521622515',
        '_gid': 'GA1.2.2063658924.1521622515',
        '_gat': '1',
    }
    
    headers = {
        'Origin': 'https://www.sanego.de',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'en-US,en;q=0.9',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36',
        'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
        'Accept': 'text/javascript, text/html, application/xml, text/xml, */*',
        'Referer': 'https://www.sanego.de/Arzt/Allgemeine+Chirurgie/',
        'X-Requested-With': 'XMLHttpRequest',
        'Connection': 'keep-alive',
        'DNT': '1',
    }
    
    data = [
      ('doctorType', 'Arzt'),
      ('federalStateOrMedicalArea', 'Allgemeine Chirurgie'),
      ('p', '1'),
      ('sortBy', ''),
      ('sortOrder', ''),
    ]
    
    response = FormRequest('https://www.sanego.de/ajax/load-more-doctors-for-search', headers=headers, cookies=cookies, formdata=data)
    

    Notice the ('p', '1') argument, and keep increment it until you reach final page.