I'm using scrapy to create a sample web crawler as an Nameko dependecy provider but it's not crawling any page. Below is the code
import scrapy
from scrapy import crawler
from nameko import extensions
from twisted.internet import reactor
class TestSpider(scrapy.Spider):
name = 'test_spider'
result = None
def parse(self, response):
TestSpider.result = {
'heading': response.css('h1::text').extract_first()
}
class ScrapyDependency(extensions.DependencyProvider):
def get_dependency(self, worker_ctx):
return self
def crawl(self, spider=None):
spider = TestSpider()
spider.name = 'test_spider'
spider.start_urls = ['http://www.example.com']
self.runner = crawler.CrawlerRunner()
self.runner.crawl(spider)
d = self.runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run()
return spider.result
def run(self):
if not reactor.running:
reactor.run()
and here is the log.
Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
Enabled item pipelines:
[]
Spider opened
Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
Closing spider (finished)
Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 9, 3, 12, 41, 40, 126088),
'log_count/INFO': 7,
'memusage/max': 59650048,
'memusage/startup': 59650048,
'start_time': datetime.datetime(2017, 9, 3, 12, 41, 40, 97747)}
Spider closed (finished)
In the log we can see it didn't crawl a single page, expected to crawl one page.
Whereas, if I create a regular CrawlerRunner
and crawl the page I get the expected result back as {'heading': 'Example Domain'}
. Below is the code:
import scrapy
class TestSpider(scrapy.Spider):
name = 'test_spider'
start_urls = ['http://www.example.com']
result = None
def parse(self, response):
TestSpider.result = {'heading': response.css('h1::text').extract_first()}
def crawl():
runner = crawler.CrawlerRunner()
runner.crawl(TestSpider)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run()
if __name__ == '__main__':
crawl()
It's been couple of days struggling with this issue, I'm unable to figure out when using scrapy crawler as nameko dependecy provider unable to crawl pages. Please correct me where I'm going wrong.
Tarun's comment is correct. Nameko uses Eventlet for concurrency, whereas Scrapy uses Twisted. These both work in a similar way: there is a main thread (the Reactor, in Twisted) that schedules all the other work, as an alternative to the normal Python threading scheduler. Unfortunately the two systems don't interoperate.
If you really want to integrate Nameko and Scrapy, your best bet is to use a separate process for Scrapy, as in the answers to these questions: