Search code examples
pythontensorflowdistributed-computingdistributed

Distributed Tensorflow not working with simple example


I'm following an example here to learn distributed TF on MNIST. I changed the cluster config to:

parameter_servers = ["1.2.3.4:2222"]
workers = [ "1.2.3.4:2222", "5.6.7.8:2222"]

1.2.3.4 and 5.6.7.8 are just representations of my two nodes. They are not the real IP address. The whole script is named example.py

On 1.2.3.4, I ran: python example.py --job_name=ps --task_index=0 .Then on the same machine, I ran python example --job_name=worker --task_index=0 in a different terminal. Looks like it's just waiting.

On 5,6,7,8, I ran python example.py --job_name=worker --taks_index=1. After that I immediately get the following error on 5.6.7.8:

tensorflow.python.framework.errors.UnavailableError: {"created":"@1480458325.580095889","description":"EOF","file":"external/grpc/src/core/lib/iomgr/tcp_posix.c","file_line":235,"grpc_status":14}
I tensorflow/core/distributed_runtime/master_session.cc:845] DeregisterGraph error: Aborted: Graph handle is not found: . Possibly, this worker just restarted.

And

tensorflow/core/distributed_runtime/graph_mgr.cc:55] 'unit.device' Must be non NULL
Aborted (core dumped)

on 1.2.3.4

Is this because I'm running both the parameter server and worker on the same machine? I don't have more than 2 nodes so how do I fix this?


Solution

  • So after a day I finally got the fix:

    1. Do as Yaroslav suggest for the param server, so that the worker doesn't run out of GPU memory
    2. The param server and worker cannot run on the same port (as the original post), so change workers = [ "1.2.3.4:2222", "5.6.7.8:2222"] to workers = [ "1.2.3.4:2223", "5.6.7.8:2222"]. Note the change in port number.

    That's everything that needs to be done.