Search code examples
pythonscrapytwistedlxmlpypy

Running Scrapy on PyPy


Is it possible to run Scrapy on PyPy? I've looked through the documentation and the github project, but the only place where PyPy is mentioned is that there were some unit tests being executed on PyPy 2 years ago, see PyPy support. There is also Scrapy fails in PyPy long discussion happened 3 years ago without a concrete resolution or a follow-up.

From what I understand, the main Scrapy's dependency Twisted is known to work on PyPy. Scrapy also uses lxml for HTML parsing, which has a PyPy-friendly fork. The other dependency, pyOpenSSL is fully supported (thanks to @Glyph's comment).


Solution

  • Yes. :-)

    In a bit more detail, I already had a version of pypy 2.6.0 (with pip) installed on my box. Simply running pip install scrapy nearly just worked for me. Turns out I needed some extra libraries for lxml. After that it was fine.

    Once installed, I could run the dmoz tutorial. For example:

    [user@localhost scrapy_proj]# scrapy crawl dmoz
    2015-06-30 14:34:45 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapy_proj)
    2015-06-30 14:34:45 [scrapy] INFO: Optional features available: ssl, http11
    2015-06-30 14:34:45 [scrapy] INFO: Overridden settings: {'BOT_NAME': 'scrapy_proj', 'NEWSPIDER_MODULE': 'scrapy_proj.spiders', 'SPIDER_MODULES': ['scrapy_proj.spiders']}
    2015-06-30 14:34:45 [py.warnings] WARNING: :0: UserWarning: You do not have a working installation of the service_identity module: 'No module named service_identity'.  Please install it from <https://pypi.python.org/pypi/service_identity> and make sure all of its dependencies are satisfied.  Without the service_identity module and a recent enough pyOpenSSL to support it, Twisted can perform only rudimentary TLS client hostname verification.  Many valid certificate/hostname mappings may be rejected.
    
    2015-06-30 14:34:45 [scrapy] INFO: Enabled extensions: CoreStats, TelnetConsole, CloseSpider, LogStats, SpiderState
    2015-06-30 14:34:45 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
    2015-06-30 14:34:45 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
    2015-06-30 14:34:45 [scrapy] INFO: Enabled item pipelines: 
    2015-06-30 14:34:45 [scrapy] INFO: Spider opened
    2015-06-30 14:34:45 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2015-06-30 14:34:45 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
    2015-06-30 14:34:46 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
    2015-06-30 14:34:46 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
    2015-06-30 14:34:46 [scrapy] INFO: Closing spider (finished)
    2015-06-30 14:34:46 [scrapy] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 514,
     'downloader/request_count': 2,
     'downloader/request_method_count/GET': 2,
     'downloader/response_bytes': 16286,
     'downloader/response_count': 2,
     'downloader/response_status_count/200': 2,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2015, 6, 30, 13, 34, 46, 219002),
     'log_count/DEBUG': 3,
     'log_count/INFO': 7,
     'log_count/WARNING': 1,
     'response_received_count': 2,
     'scheduler/dequeued': 2,
     'scheduler/dequeued/memory': 2,
     'scheduler/enqueued': 2,
     'scheduler/enqueued/memory': 2,
     'start_time': datetime.datetime(2015, 6, 30, 13, 34, 45, 652421)}
    2015-06-30 14:34:46 [scrapy] INFO: Spider closed (finished)
    

    And as requested, here's some more info on the version I'm running:

    [user@localhost scrapy_proj]# which scrapy
    /opt/pypy/bin/scrapy
    [user@localhost scrapy_proj]# scrapy version
    2015-06-30 15:04:42 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapy_proj)
    2015-06-30 15:04:42 [scrapy] INFO: Optional features available: ssl, http11
    2015-06-30 15:04:42 [scrapy] INFO: Overridden settings: {'BOT_NAME': 'scrapy_proj', 'NEWSPIDER_MODULE': 'scrapy_proj.spiders', 'SPIDER_MODULES': ['scrapy_proj.spiders']}
    Scrapy 1.0.0