Search code examples
seleniumweb-scrapingtwistedscrapy

Defer parts of scrape in scrapy


I have the parse method given below, I uses selenium to first load a page, visits certain pages that cannot be accessed by a scraping directly from the spider, collects individual urls to another parse method which extracts items from the pages. The problem is, this parse method blocks other parsing untill all pages are visited. This chokes the system. I tried adding a sleep, but this stops the engine alltogether, and not just this parse method.

Any pointers as to how I could optimize this, or atleast make sleep work so that it doesnt stop the engine?

def parse(self, response):
    '''Parse first page and extract page links'''

    item_link_xpath = "/html/body/form/div[@class='wrapper']//a[@title='View & Apply']"
    pagination_xpath = "//div[@class='pagination']/input"
    page_xpath = pagination_xpath + "[@value=%d]"

    display = Display(visible=0, size=(800, 600))
    display.start()

    browser = webdriver.Firefox()
    browser.get(response.url)
    log.msg('Loaded search results', level=log.DEBUG)

    page_no = 1
    while True:
        log.msg('Scraping page: %d'%page_no, level=log.DEBUG)
        for link in [item_link.get_attribute('href') for item_link in browser.find_elements_by_xpath(item_link_xpath)]:
            yield Request(link, callback=self.parse_item_page)
        page_no += 1
        log.msg('Using xpath: %s'%(page_xpath%page_no), level=log.DEBUG)
        page_element = browser.find_element_by_xpath(page_xpath%page_no)
        if not page_element or page_no > settings['PAGINATION_PAGES']:
            break
        page_element.click()
        if settings['PAGINATION_SLEEP_INTERVAL']:
            seconds = int(settings['PAGINATION_SLEEP_INTERVAL'])
            log.msg('Sleeping for %d'%seconds, level=log.DEBUG)
            time.sleep(seconds)
    log.msg('Scraped listing pages, closing browser.', level=log.DEBUG)
    browser.close()
    display.stop()

Solution

  • This may help:

    # delayspider.py
    from scrapy.spider import BaseSpider
    from twisted.internet import reactor, defer
    from scrapy.http import Request
    
    DELAY = 5 # seconds
    
    
    class MySpider(BaseSpider):
    
        name = 'wikipedia'
        max_concurrent_requests = 1
    
        start_urls = ['http://www.wikipedia.org']
    
        def parse(self, response):
            nextreq = Request('http://en.wikipedia.org')
            dfd = defer.Deferred()
            reactor.callLater(DELAY, dfd.callback, nextreq)
            return dfd
    

    Output:

    $ scrapy runspider delayspider.py 
    2012-05-24 11:01:54-0300 [scrapy] INFO: Scrapy 0.15.1 started (bot: scrapybot)
    2012-05-24 11:01:54-0300 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
    2012-05-24 11:01:54-0300 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
    2012-05-24 11:01:54-0300 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
    2012-05-24 11:01:54-0300 [scrapy] DEBUG: Enabled item pipelines: 
    2012-05-24 11:01:54-0300 [wikipedia] INFO: Spider opened
    2012-05-24 11:01:54-0300 [wikipedia] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2012-05-24 11:01:54-0300 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
    2012-05-24 11:01:54-0300 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
    2012-05-24 11:01:56-0300 [wikipedia] DEBUG: Crawled (200) <GET http://www.wikipedia.org> (referer: None)
    2012-05-24 11:02:04-0300 [wikipedia] DEBUG: Redirecting (301) to <GET http://en.wikipedia.org/wiki/Main_Page> from <GET http://en.wikipedia.org>
    2012-05-24 11:02:06-0300 [wikipedia] DEBUG: Crawled (200) <GET http://en.wikipedia.org/wiki/Main_Page> (referer: http://www.wikipedia.org)
    2012-05-24 11:02:11-0300 [wikipedia] INFO: Closing spider (finished)
    2012-05-24 11:02:11-0300 [wikipedia] INFO: Dumping spider stats:
        {'downloader/request_bytes': 745,
         'downloader/request_count': 3,
         'downloader/request_method_count/GET': 3,
         'downloader/response_bytes': 29304,
         'downloader/response_count': 3,
         'downloader/response_status_count/200': 2,
         'downloader/response_status_count/301': 1,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2012, 5, 24, 14, 2, 11, 447498),
         'request_depth_max': 2,
         'scheduler/memory_enqueued': 3,
         'start_time': datetime.datetime(2012, 5, 24, 14, 1, 54, 408882)}
    2012-05-24 11:02:11-0300 [wikipedia] INFO: Spider closed (finished)
    2012-05-24 11:02:11-0300 [scrapy] INFO: Dumping global stats:
        {}
    

    It uses Twisted's callLater to sleep.