Search code examples
ubuntuscrapyscrapydscrapinghub

How can I run Scrapyd on a server


As of recently Scrapinghub no longer has periodic jobs in their free package, which is what I used to use to run my Scrapy crawlers.

Therefore, I decided to use Scrapyd instead. So I went ahead and got a virtual server running Ubuntu 16.04. (This is my first time setting up and running a server, so please bear with me)

Following the instructions on scrapyd.readthedocs.io I installed Scrapyd using pip:

$ pip install scrapyd

(That was after I figured out that the recommended way for Ubuntu, using apt-get, is actually no longer supported, see Github).

Then I log onto my server using SSH, and run Scrapyd by simply running

$ scrapyd

Everything looks fine as far as I can tell:

2017-10-30 17:31:19+0000 [-] Log opened.
2017-10-30 17:31:19+0000 [-] twistd 16.0.0 (/usr/bin/python 2.7.12) starting up.
2017-10-30 17:31:19+0000 [-] reactor class: twisted.internet.epollreactor.EPollReactor.
2017-10-30 17:31:19+0000 [-] Site starting on 6800
2017-10-30 17:31:19+0000 [-] Starting factory <twisted.web.server.Site instance at 0x7f644752bfc8>
2017-10-30 17:31:19+0000 [Launcher] Scrapyd 1.2.0 started: max_proc=4, runner=u'scrapyd.runner'

I would expect to see a web interface (described here) when I go to my IP at http://82.165.102.18:6800.

Instead, I just get the error message "This site can’t be reached 82.165.102.18 refused to connect."

When I try to run Scrapyd locally, everything works just fine, and I get the web interface at http://localhost:6800/.

I have tried disabling the Firewall (UFW), but that didn't help.

At this point, I am lost. If you have any ideas, please let me know!

Thanks a lot!


Solution

  • If you can reach your Scrapyd instance locally but not over network, I suspect Scrapyd listens only on localhost. Be sure to have this line in your scrapyd.conf:

    bind_address = 0.0.0.0
    

    It instructs Scrapyd to listen on all interfaces. bind_address defaults to 127.0.0.1, so by default it only listens on localhost.