I'm following an example here to learn distributed TF on MNIST. I changed the cluster config to:
parameter_servers = ["1.2.3.4:2222"]
workers = [ "1.2.3.4:2222", "5.6.7.8:2222"]
1.2.3.4
and 5.6.7.8
are just representations of my two nodes. They are not the real IP address. The whole script is named example.py
On 1.2.3.4
, I ran: python example.py --job_name=ps --task_index=0
.Then on the same machine, I ran python example --job_name=worker --task_index=0
in a different terminal. Looks like it's just waiting.
On 5,6,7,8
, I ran python example.py --job_name=worker --taks_index=1
. After that I immediately get the following error on 5.6.7.8
:
tensorflow.python.framework.errors.UnavailableError: {"created":"@1480458325.580095889","description":"EOF","file":"external/grpc/src/core/lib/iomgr/tcp_posix.c","file_line":235,"grpc_status":14}
I tensorflow/core/distributed_runtime/master_session.cc:845] DeregisterGraph error: Aborted: Graph handle is not found: . Possibly, this worker just restarted.
And
tensorflow/core/distributed_runtime/graph_mgr.cc:55] 'unit.device' Must be non NULL
Aborted (core dumped)
on 1.2.3.4
Is this because I'm running both the parameter server and worker on the same machine? I don't have more than 2 nodes so how do I fix this?
So after a day I finally got the fix:
workers = [ "1.2.3.4:2222", "5.6.7.8:2222"]
to workers = [ "1.2.3.4:2223", "5.6.7.8:2222"]
. Note the change in port number.That's everything that needs to be done.