Search code examples
pythonscrapytwistednameko

Scrapy Nameko DependencyProvider not crawling the page


I'm using scrapy to create a sample web crawler as an Nameko dependecy provider but it's not crawling any page. Below is the code

import scrapy
from scrapy import crawler
from nameko import extensions
from twisted.internet import reactor


class TestSpider(scrapy.Spider):
    name = 'test_spider'
    result = None

    def parse(self, response):
        TestSpider.result = {
            'heading': response.css('h1::text').extract_first()
        }


class ScrapyDependency(extensions.DependencyProvider):

    def get_dependency(self, worker_ctx):
        return self

    def crawl(self, spider=None):
        spider = TestSpider()
        spider.name = 'test_spider'
        spider.start_urls = ['http://www.example.com']
        self.runner = crawler.CrawlerRunner()
        self.runner.crawl(spider)
        d = self.runner.join()
        d.addBoth(lambda _: reactor.stop())
        reactor.run()
        return spider.result

    def run(self):
        if not reactor.running:
            reactor.run()

and here is the log.

Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
Enabled item pipelines:
[]
Spider opened

Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

Closing spider (finished)
Dumping Scrapy stats:
{'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 9, 3, 12, 41, 40, 126088),
 'log_count/INFO': 7,
 'memusage/max': 59650048,
 'memusage/startup': 59650048,
 'start_time': datetime.datetime(2017, 9, 3, 12, 41, 40, 97747)}
Spider closed (finished)

In the log we can see it didn't crawl a single page, expected to crawl one page.

Whereas, if I create a regular CrawlerRunner and crawl the page I get the expected result back as {'heading': 'Example Domain'}. Below is the code:

import scrapy


class TestSpider(scrapy.Spider):
    name = 'test_spider'
    start_urls = ['http://www.example.com']
    result = None

    def parse(self, response):
        TestSpider.result = {'heading': response.css('h1::text').extract_first()}

def crawl():
    runner = crawler.CrawlerRunner()
    runner.crawl(TestSpider)
    d = runner.join()
    d.addBoth(lambda _: reactor.stop())
    reactor.run()

if __name__ == '__main__':
    crawl()

It's been couple of days struggling with this issue, I'm unable to figure out when using scrapy crawler as nameko dependecy provider unable to crawl pages. Please correct me where I'm going wrong.


Solution

  • Tarun's comment is correct. Nameko uses Eventlet for concurrency, whereas Scrapy uses Twisted. These both work in a similar way: there is a main thread (the Reactor, in Twisted) that schedules all the other work, as an alternative to the normal Python threading scheduler. Unfortunately the two systems don't interoperate.

    If you really want to integrate Nameko and Scrapy, your best bet is to use a separate process for Scrapy, as in the answers to these questions: