Search code examples
pythontensorflowdistributed-computinggrpc

Run distributed TensorFlow, UnavailableError: Endpoint read fail


I am new to TensorFlow and do not have much experience. I am now trying the distributed TensorFlow.

Following the official guide, I first create two servers. I run the following code in two seperate terminals

import sys
import tensorflow as tf

task_number = int(sys.argv[1])

cluster = tf.train.ClusterSpec({"local": ["localhost:2222", "localhost:2223"]})
server = tf.train.Server(cluster, job_name="local", task_index=task_number)

print("Starting server #{}".format(task_number))

server.start()
server.join()

The server has been set up successfully

2018-01-25 20:05:37.651802: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job local -> {0 -> localhost:2222, 1 -> localhost:2223}
2018-01-25 20:05:37.652881: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:2222
Starting server #0
2018-01-25 20:05:37.652938: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:328] Server already started (target: grpc://localhost:2222)

Then I run the following program

import tensorflow as tf
x = tf.constant(2)

with tf.device("/job:local/task:1"):
    y2 = x - 66

with tf.device("/job:local/task:0"):
    y1 = x + 300
    y = y1 + y2

with tf.Session("grpc://localhost:2223") as sess:
    result = sess.run(y)
    print(result)

Then it gives me the following error message

E0125 20:05:49.573488650   10292 ev_epoll1_linux.c:1051]     grpc epoll fd: 5
Traceback (most recent call last):
  File "/home/****/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call
    return fn(*args)
  File "/home/****/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1293, in _run_fn
    self._extend_graph()
  File "/home/****/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1354, in _extend_graph
    self._session, graph_def.SerializeToString(), status)
  File "/home/****/miniconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnavailableError: Endpoint read failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/****/Documents/intern/sample_data/try.py", line 25, in <module>
    result = sess.run(y)
  File "/home/****/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 889, in run
    run_metadata_ptr)
  File "/home/****/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1120, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/****/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
    options, run_metadata)
  File "/home/****/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnavailableError: Endpoint read failed

I googled it and some suggest that it might be the problems with proxy, so I have disabled the proxy but nothing changed.

Does anyone have any ideas what the problems might be? Many thanks in advance.


Solution

  • Never mind, problems solved. It is the setting about the proxy. We need to unset proxy on both servers and clients to make the program work.