Search code examples

Scrapy tutorial (noob) - 0 pages crawled

I've been trying to follow the Scrapy tutorial (as in, very very beginning) and after running the command at the project top level (i.e. the level with scrapy.cfg) I get the following output:

 mikey@ubuntu:~/scrapy/tutorial$ scrapy crawl dmoz
/usr/lib/pymodules/python2.7/scrapy/settings/ ScrapyDeprecationWarning: You are using the following settings which are deprecated or obsolete (ask for alternatives):
    BOT_VERSION: no longer used (user agent defaults to Scrapy now)
  warnings.warn(msg, ScrapyDeprecationWarning)
2014-01-26 04:17:06-0800 [scrapy] INFO: Scrapy 0.22.0 started (bot: tutorial)
2014-01-26 04:17:06-0800 [scrapy] INFO: Optional features available: ssl, http11, boto, django
2014-01-26 04:17:06-0800 [scrapy] INFO: Overridden settings: {'DEFAULT_ITEM_CLASS': 'tutorial.items.TutorialItem', 'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'USER_AGENT': 'tutorial/1.0', 'BOT_NAME': 'tutorial'}
2014-01-26 04:17:06-0800 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-01-26 04:17:06-0800 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-01-26 04:17:06-0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-01-26 04:17:06-0800 [scrapy] INFO: Enabled item pipelines: 
2014-01-26 04:17:06-0800 [dmoz] INFO: Spider opened
2014-01-26 04:17:06-0800 [dmoz] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-26 04:17:06-0800 [scrapy] DEBUG: Telnet console listening on
2014-01-26 04:17:06-0800 [scrapy] DEBUG: Web service listening on
2014-01-26 04:17:06-0800 [dmoz] DEBUG: Crawled (200) <GET> (referer: None)
2014-01-26 04:17:07-0800 [dmoz] DEBUG: Crawled (200) <GET> (referer: None)
2014-01-26 04:17:07-0800 [dmoz] INFO: Closing spider (finished)
2014-01-26 04:17:07-0800 [dmoz] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 472,
     'downloader/request_count': 2,
     'downloader/request_method_count/GET': 2,
     'downloader/response_bytes': 14888,
     'downloader/response_count': 2,
     'downloader/response_status_count/200': 2,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2014, 1, 26, 12, 17, 7, 63261),
     'log_count/DEBUG': 4,
     'log_count/INFO': 7,
     'response_received_count': 2,
     'scheduler/dequeued': 2,
     'scheduler/dequeued/memory': 2,
     'scheduler/enqueued': 2,
     'scheduler/enqueued/memory': 2,
     'start_time': datetime.datetime(2014, 1, 26, 12, 17, 6, 567929)}
2014-01-26 04:17:07-0800 [dmoz] INFO: Spider closed (finished)

(I.e. 0 pages crawled at 0/a second!!!!!!!!!!!!!!)

Troubleshooting so far: 1) Checked syntax of both and (both copied and pasted AND hand-typed) 2) Checked for problem online but cannot see others with similar issue 3) Checked folder structure etc making sure running command from correct place 4) Upgraded to latest version of scrapy

Any suggestions? My code is precisely as in the examples is......

from scrapy.spider import Spider

class DmozSpider(Spider):
    name = "dmoz"
    allowed_domains = [""]
    start_urls = [

    def parse(self, response):
        filename = response.url.split("/")[-2]
        open(filename, 'wb').write(response.body)


from scrapy.item import Item, Field

class DmozItem(Item):
    title = Field()
    link = Field()
    desc = Field()


  • First you should find out what you want to crawl.

    You passed the two start urls to scrapy, so it crawled them, but could not find more urls to follow.

    All book links on that page are not meeting allowed_domains

    You can do yield Request([next url]) to crawl more links, next url can be parsed from response.

    Or inherit CrawlSpider and specify rules like this example.