My system specs: Ubuntu 17.10, 4 gb RAM, 50 gb swap
I would like to crawl all 24.453 records from https://www.sanego.de/Arzt/Allgemeine+Chirurgie/.
I cannot load the page, seemingly due to its size
Initially the webpage displays only first 30 records. By clicking a button 'title="Mehr anzeigen"' one time I can load another +30 more records. This can be repeated until all the records are loaded. So, its generated dynamically with javascript.
My idea is to press the button 'title="Mehr anzeigen"' as many times as it is needed for all the 24.453 records to be displayed on the page. Once it's done, I will be able to parse the page and collect all the records.
I tried two different spiders for this. First, I tried to do that with writing a Scrapy spider implementing Selenium for rendering the dynamic content. However, this solution turned out to be too costly in terms of memory usage. The process eats all the RAM and crashes after around 1500 records is loaded
I assumed that this solution may be faster and less memory demanding then the previous one, however, the page loading exceeds Splash's max timeout limit of 3600 seconds and the spider crashes. I'll provide only this spider's code below, since I feel that Splash may be a better solution for this case. Please, ask if you'd like me to add the other one's too.
I ran each of the spiders in cgroups imposing memory limit of 1gb. The spides stay within the memory limits, but crash anyway before the page is fully loaded.
Please, provide me with any suggestions on how can I acheive the goal
That's how I start splash:
sudo cgexec -g memory:limitmem docker run -it --memory="1024m"
--memory-swappiness="100" -p 8050:8050 scrapinghub/splash --max-timeout 3600
That's how I run the spider:
sudo cgexec -g memory:limitmem scrapy crawl spersonel_spider
Main part of the spider:
from scrapy_splash import SplashRequest
import time
import json
import scrapy
from scrapy import Request
from sanego.items import PersonelItem
class SanegoSpider(scrapy.Spider):
name = "spersonel_spider"
start_urls = ['https://www.sanego.de/Arzt/Fachgebiete/','https://www.sanego.de/Zahnarzt/Fachgebiete/', 'https://www.sanego.de/Heilpraktiker/Fachgebiete/', 'https://www.sanego.de/Tierarzt/Fachgebiete/',]
def parse(self, response):
search_urls = ["https://www.sanego.de" + url for url in response.xpath('//ul[@class="itemList"]/li[contains(@class,"col-md-4")]/a/@href').extract()]
script = """
function main(splash)
local url = splash.args.url
splash.images_enabled = false
assert(splash:go(url))
assert(splash:wait(1))
local element = splash:select('.loadMore')
while element ~= nil do
assert(element:mouse_click())
assert(splash:wait{2,cancel_on_error=true})
element = splash:select('.loadMore')
end
return {
html = splash:html(),
--png = splash:png(),
--har = splash:har(),
}
end
"""
for url in search_urls:
if url == 'https://www.sanego.de/Arzt/Allgemeine+Chirurgie/':
yield SplashRequest(url, self.parse_search_results, args={'wait': 2, 'lua_source': script, 'timeout':3600},endpoint='execute', headers={'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/603.3.8 (KHTML, like Gecko) Version/10.1.2 Safari/603.3.8'})
That page loads more data on AJAX, so simulate AJAX using simple Scrapy, without using Splash.
import requests
cookies = {
'sanego_sessid': 'meomk0luq31rcjl5qp38tsftp1',
'AWSELB': '0D1143B71ECAB811932E9F0030D39880BEAC9BABBC8CD3C44A99B4B781E433D347A4C2A6FDF836A5F4A4BE16334FBDA671EC87316CB08EB740C12A444F7E4A1EE15E3F26E2',
'_ga': 'GA1.2.882998560.1521622515',
'_gid': 'GA1.2.2063658924.1521622515',
'_gat': '1',
}
headers = {
'Origin': 'https://www.sanego.de',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Accept': 'text/javascript, text/html, application/xml, text/xml, */*',
'Referer': 'https://www.sanego.de/Arzt/Allgemeine+Chirurgie/',
'X-Requested-With': 'XMLHttpRequest',
'Connection': 'keep-alive',
'DNT': '1',
}
data = [
('doctorType', 'Arzt'),
('federalStateOrMedicalArea', 'Allgemeine Chirurgie'),
('p', '1'),
('sortBy', ''),
('sortOrder', ''),
]
response = FormRequest('https://www.sanego.de/ajax/load-more-doctors-for-search', headers=headers, cookies=cookies, formdata=data)
Notice the ('p', '1')
argument, and keep increment it until you reach final page.