Search code examples
scrapysignalstwisted.internet

Scrapy (1.0) - Signals not received


What i'm trying to do is trigger a function (abc) when a scrapy spider is opened, which sould be triggered by scrapys 'signals'.

(Later on i wanna change it to 'closed' to save the stats from each spider to the database for a daily monitoring.) So for now i tried this simply solution just to print something out, what i would expect to see in the console when i run the crawlerprocess in the moment the spider is opend.

What happen is that the crawler runs fine, but is does not print the output of 'abc' the moment the spider is openend which should trigger the output. At the end i posted what is see in the console, which is just that the spider is running perfectly fine.

Why is the abc function not triggered by the signal at the point where is see 'INFO: Spider opened' in the log (or at all)?

MyCrawlerProcess.py:

from twisted.internet import reactor
from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())

def abc():
    print '######################works!######################'

def from_crawler(crawler):
    crawler.signals.connect(abc, signal=signals.spider_opened)

process.crawl('Dissident')
process.start() # the script will block here until the crawling is finished

Console output:

2016-03-17 13:00:14 [scrapy] INFO: Scrapy 1.0.4 started (bot: Chrome 41.0.2227.1. Mozilla/5.0 (Macintosh; Intel Mac Osource)
2016-03-17 13:00:14 [scrapy] INFO: Optional features available: ssl, http11
2016-03-17 13:00:14 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'scrapytry.spiders', 'SPIDER_MODULES': ['scrapytry.spiders'], 'DOWNLOAD_DELAY': 5, 'BOT_NAME': 'Chrome 41.0.2227.1. Mozilla/5.0 (Macintosh; Intel Mac Osource'}
2016-03-17 13:00:14 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-03-17 13:00:14 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-03-17 13:00:14 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-03-17 13:00:14 [scrapy] INFO: Enabled item pipelines: ImagesPipeline, FilesPipeline, ScrapytryPipeline
2016-03-17 13:00:14 [scrapy] INFO: Spider opened
2016-03-17 13:00:14 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-17 13:00:14 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-17 13:00:14 [scrapy] DEBUG: Crawled (200) <GET http://www.xyz.zzm/> (referer: None)

Solution

  • Simply defining from_crawler isn't enough as it's not hooked into the scrapy framework. Take a look at the docs here which show how to create an extension that does exactly what you're attempting to do. Be sure to follow the instruction for enabling the extension via the MYEXT_ENABLED setting.