Search code examples
pythonpathos

How do I get pathos to spawn processes on my remote server?


I have 2 computers, both of which have the pathos Python module. I have a Pathos multiprocessing pool and have been trying to get pathos to split the number of processes evenly between the two CPUs using the following code:

from pathos.multiprocessing import ProcessPool
ngramPool = ProcessPool()
ngramPool.ncpus = 8
ngramPool.servers = ('localhost:5653','ec2-18-223-23-82.us-east-2.compute.amazonaws.com:5653')
questionNgrams = []
i = 0
previousI = 0
previousTime = time.time()
#Test questions
#questions = ["To whom do I owe this great pleasure","Who do I owe this great pleasure which is a great pleasure to","Who do I owe this great pleasure to"]
questionNgrams = ngramPool.map(n_gram.stringToNgrams,questions)

However, instead of running 4 processes on my local CPU and 4 on the Amazon EC2 instance, all 8 processes are being run on my local processor. How do I set up pathos so that it spawns 4 processes on my CPU and another 4 on the Amazon instance?


Solution

  • I'm the pathos author. Working with distributed resources isn't as straightforward as you might want. You are correct (in your comments) that pathos uses RPC-based (wrapped in SSH) connections. You are also correct in that you have to set up a ppserver on the remote host. If you need to make a ssh connection, then you can do that with the pathos_connect script (see associated documentation), or directly with code in the pathos.secure module. Note that you'll also need to make sure that you have a working ssh-agent and have set up ssh key-pair authentication (i.e. uses no passphrase after the initial connection).

    Having said that, it's pretty difficult to specifically get 4 remote workers and 4 local workers -- as the ParallelPool is dynamically load balanced. Thus, if you have "quick" tasks to run, the vast majority, if not all, of the tasks will run locally as spinning up the connection and shipping the tasks and retrieving the results will take more time than just running the jobs locally. You can force tasks to run remotely by zeroing out (or seriously limiting) the ncpus locally available for the pool, but how many jobs run where will depend on an automated load balance of the number of locally available tasks, and some measure of the time it takes for an individual job to complete versus the time it takes to connect and run the jobs remotely.