Search code examples
pythondeploymentweb-scrapingscrapyscrapyd

Scrapyd deploy project on a server with dynamic ip


I want to deploy my scrapy project on a ip that is not listed in the scrapy.cfg file , because the ip can change and i want to automate the process of deploying. i tried giving the ip of the server directly in the deploy command but it did not work. any suggestion to do this?


Solution

  • First, you should consider assigning a domain to the server, so you can always get to it regardless of its dynamic IP. DynDNS comes handy at times.

    Second, you probably won't do the first, because you haven't got access to the server, or for whatever other reason. In that case, I suggest mimicking above behavior by using your system's hosts file. As described at wikipedia article:

    The hosts file is a computer file used by an operating system to map hostnames to IP addresses.

    For example, lets say you set your url to remotemachine in your scrapy.cfg. You can write a script that would edit the hosts file with the latest IP address, and execute it before deploying your spider. This approach has a benefit of having a system-wide effect, so if you are deploying multiple spiders, or using the same server for some other purpose, you don't have to update multiple configuration files.

    This script could look something like this:

    import fileinput
    import sys
    
    def update_hosts(hostname, ip):
    
        if 'linux' in sys.platform:
            hosts_path = '/etc/hosts'
        else:
            hosts_path = 'c:\windows\system32\drivers\etc\hosts'
    
        for line in fileinput.input(hosts_path, inplace=True):
            if hostname in line:
                print "{0}\t{1}".format(hostname, ip)
            else:
                print line.strip()
    
    if __name__ == '__main__':
        hostname = sys.argv[1]
        ip = sys.argv[2]
        update_hosts(hostname, ip)
        print "Done!"
    

    Ofcourse,you should do additional argument checks, etc., this is just a quick example.
    You can then run it prior deploying like this:

    python updatehosts.py remotemachine <remote_ip_here>
    

    If you want to take it a step further and add this functionality as a simple argument to scrapyd-deploy, you can go ahead and edit your scrapyd-deploy file (its just a Python script) to add the additional parameter and update the hosts file from within. But I'm not sure this is the best thing to do, since leaving this implementation separate and more explicit would probably be a better choice.