I have the parse method given below, I uses selenium to first load a page, visits certain pages that cannot be accessed by a scraping directly from the spider, collects individual urls to another parse method which extracts items from the pages. The problem is, this parse method blocks other parsing untill all pages are visited. This chokes the system. I tried adding a sleep, but this stops the engine alltogether, and not just this parse
method.
Any pointers as to how I could optimize this, or atleast make sleep work so that it doesnt stop the engine?
def parse(self, response):
'''Parse first page and extract page links'''
item_link_xpath = "/html/body/form/div[@class='wrapper']//a[@title='View & Apply']"
pagination_xpath = "//div[@class='pagination']/input"
page_xpath = pagination_xpath + "[@value=%d]"
display = Display(visible=0, size=(800, 600))
display.start()
browser = webdriver.Firefox()
browser.get(response.url)
log.msg('Loaded search results', level=log.DEBUG)
page_no = 1
while True:
log.msg('Scraping page: %d'%page_no, level=log.DEBUG)
for link in [item_link.get_attribute('href') for item_link in browser.find_elements_by_xpath(item_link_xpath)]:
yield Request(link, callback=self.parse_item_page)
page_no += 1
log.msg('Using xpath: %s'%(page_xpath%page_no), level=log.DEBUG)
page_element = browser.find_element_by_xpath(page_xpath%page_no)
if not page_element or page_no > settings['PAGINATION_PAGES']:
break
page_element.click()
if settings['PAGINATION_SLEEP_INTERVAL']:
seconds = int(settings['PAGINATION_SLEEP_INTERVAL'])
log.msg('Sleeping for %d'%seconds, level=log.DEBUG)
time.sleep(seconds)
log.msg('Scraped listing pages, closing browser.', level=log.DEBUG)
browser.close()
display.stop()
This may help:
# delayspider.py
from scrapy.spider import BaseSpider
from twisted.internet import reactor, defer
from scrapy.http import Request
DELAY = 5 # seconds
class MySpider(BaseSpider):
name = 'wikipedia'
max_concurrent_requests = 1
start_urls = ['http://www.wikipedia.org']
def parse(self, response):
nextreq = Request('http://en.wikipedia.org')
dfd = defer.Deferred()
reactor.callLater(DELAY, dfd.callback, nextreq)
return dfd
Output:
$ scrapy runspider delayspider.py
2012-05-24 11:01:54-0300 [scrapy] INFO: Scrapy 0.15.1 started (bot: scrapybot)
2012-05-24 11:01:54-0300 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2012-05-24 11:01:54-0300 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-05-24 11:01:54-0300 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-05-24 11:01:54-0300 [scrapy] DEBUG: Enabled item pipelines:
2012-05-24 11:01:54-0300 [wikipedia] INFO: Spider opened
2012-05-24 11:01:54-0300 [wikipedia] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-05-24 11:01:54-0300 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-05-24 11:01:54-0300 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-05-24 11:01:56-0300 [wikipedia] DEBUG: Crawled (200) <GET http://www.wikipedia.org> (referer: None)
2012-05-24 11:02:04-0300 [wikipedia] DEBUG: Redirecting (301) to <GET http://en.wikipedia.org/wiki/Main_Page> from <GET http://en.wikipedia.org>
2012-05-24 11:02:06-0300 [wikipedia] DEBUG: Crawled (200) <GET http://en.wikipedia.org/wiki/Main_Page> (referer: http://www.wikipedia.org)
2012-05-24 11:02:11-0300 [wikipedia] INFO: Closing spider (finished)
2012-05-24 11:02:11-0300 [wikipedia] INFO: Dumping spider stats:
{'downloader/request_bytes': 745,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 29304,
'downloader/response_count': 3,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/301': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2012, 5, 24, 14, 2, 11, 447498),
'request_depth_max': 2,
'scheduler/memory_enqueued': 3,
'start_time': datetime.datetime(2012, 5, 24, 14, 1, 54, 408882)}
2012-05-24 11:02:11-0300 [wikipedia] INFO: Spider closed (finished)
2012-05-24 11:02:11-0300 [scrapy] INFO: Dumping global stats:
{}
It uses Twisted's callLater to sleep.