Search code examples
web-crawlerscrapyscrapy-splash

Run Splash From File


I've been researching this for a few days, I've found a lot of answers that are kind of like my question, but not really so I decided to go ahead and post this question. I'm using scrapy-splash to crawl KBB. I was able to get around the stupid first time use popup thing by using send_text and send_keys, this works super well in the browser version of Splash. It pulls in the dynamic content just like I want, AWESOME!

enter image description here

Here's the code for easy copy-paste-ability;

function main(splash, args)
  assert(splash:go(args.url))
  assert(splash:wait(0.5))
  splash:send_text("24153")
  splash:send_keys("<Return>")
  assert(splash:wait(5))
  return {
    html = splash:html(),
    png = splash:png(),
    har = splash:har(),
  }
end

Now I'm trying to make it work in-script because I want to be able to render multiple HTML files all at once. This is the code I have so far, I just have two URLs in there to test for now:

import scrapy
from scrapy_splash import SplashRequest

class MySpider(scrapy.Spider):
    name = "cars"
    start_urls = ["https://www.kbb.com/ford/escape/2017/titanium/", "https://www.kbb.com/honda/cr-v/2017/touring/"]

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, self.parse,
                endpoint='render.html',
                args={'wait': 0.5, 'send_text':24153, 'send_keys':'<Return>', 'wait': 5.0},
            )

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'car-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)

When I try to run this it just keeps telling me that things timed out:

2018-01-16 19:34:31 [scrapy.extensions.logstats] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-01-16 19:35:02 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://192.168.65.0:8050/robots.txt> (failed 2 times): TCP connection timed out: 60: Operation timed out.
2018-01-16 19:35:31 [scrapy.extensions.logstats] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-01-16 19:36:17 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://192.168.65.0:8050/robots.txt> (failed 3 times): TCP connection timed out: 60: Operation timed out.
2018-01-16 19:36:17 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET http://192.168.65.0:8050/robots.txt>: TCP connection timed out: 60: Operation timed out.
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 60: Operation timed out.
2018-01-16 19:36:31 [scrapy.extensions.logstats] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-01-16 19:37:31 [scrapy.extensions.logstats] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-01-16 19:37:32 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.kbb.com/ford/escape/2017/titanium/ via http://192.168.65.0:8050/render.html> (failed 1 times): TCP connection timed out: 60: Operation timed out.
2018-01-16 19:37:32 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.kbb.com/honda/cr-v/2017/touring/ via http://192.168.65.0:8050/render.html> (failed 1 times): TCP connection timed out: 60: Operation timed out.
2018-01-16 19:38:31 [scrapy.extensions.logstats] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-01-16 19:38:48 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.kbb.com/ford/escape/2017/titanium/ via http://192.168.65.0:8050/render.html> (failed 2 times): TCP connection timed out: 60: Operation timed out.
2018-01-16 19:38:48 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.kbb.com/honda/cr-v/2017/touring/ via http://192.168.65.0:8050/render.html> (failed 2 times): TCP connection timed out: 60: Operation timed out.

This is my settings.py custom stuff at the bottom, not sure you need the whole thing since most of it is commented out:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

SPLASH_URL = 'http://localhost:8050/'
SPLASH_URL = 'http://192.168.65.0:8050' 

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

I've been following multiple tutorials trying to get this to work. I'm assuming it has something to do with that SPIDER_MIDDLEWARES thing but I don't know what needs to change with that. I'm VERY new to spiders so any help would be very appreciated.


Solution

  • This took almost two weeks but I FINALLY got what I wanted. Had to switch to AutoBlog because KBB didn't have everything I needed. Problem with AutoBlog is that it only loads the bottom of the page when you actually scroll to it, so I used mouse_click to click on a navigation button to scroll it down to the part of the page I needed. I then waited a few seconds before rendering.

    import scrapy
    from scrapy_splash import SplashRequest
    
    class MySpider(scrapy.Spider):
        name = "cars"
        start_urls = ["https://www.autoblog.com/buy/2017-Ford-Escape-SE__4dr_4x4/", "https://www.autoblog.com/buy/2017-Honda-CR_V-EX_L__4dr_Front_wheel_Drive/"]
    
        script="""
        function main(splash, args)
          assert(splash:go(args.url))
          assert(splash:wait(10.0))
          splash:mouse_click(800, 335)  
          assert(splash:wait(10.0))
          return {
            html = splash:html()
          }
        end
        """
    
        def start_requests(self):
            for url in self.start_urls:
                yield SplashRequest(url, self.parse,
                    endpoint='execute',
                    args={'lua_source': self.script, 'wait': 1.0},
                )
    
        def parse(self, response):
            page = response.url.split("/")[-2]
            filename = 'car-%s.html' % page
            with open(filename, 'wb') as f:
                f.write(response.body)
    

    Still some touch ups to do and need to add more URLs, but it is a functioning block of code!