I am new to TensorFlow and do not have much experience. I am now trying the distributed TensorFlow.
Following the official guide, I first create two servers. I run the following code in two seperate terminals
import sys
import tensorflow as tf
task_number = int(sys.argv[1])
cluster = tf.train.ClusterSpec({"local": ["localhost:2222", "localhost:2223"]})
server = tf.train.Server(cluster, job_name="local", task_index=task_number)
print("Starting server #{}".format(task_number))
server.start()
server.join()
The server has been set up successfully
2018-01-25 20:05:37.651802: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job local -> {0 -> localhost:2222, 1 -> localhost:2223}
2018-01-25 20:05:37.652881: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:2222
Starting server #0
2018-01-25 20:05:37.652938: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:328] Server already started (target: grpc://localhost:2222)
Then I run the following program
import tensorflow as tf
x = tf.constant(2)
with tf.device("/job:local/task:1"):
y2 = x - 66
with tf.device("/job:local/task:0"):
y1 = x + 300
y = y1 + y2
with tf.Session("grpc://localhost:2223") as sess:
result = sess.run(y)
print(result)
Then it gives me the following error message
E0125 20:05:49.573488650 10292 ev_epoll1_linux.c:1051] grpc epoll fd: 5
Traceback (most recent call last):
File "/home/****/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call
return fn(*args)
File "/home/****/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1293, in _run_fn
self._extend_graph()
File "/home/****/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1354, in _extend_graph
self._session, graph_def.SerializeToString(), status)
File "/home/****/miniconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnavailableError: Endpoint read failed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/****/Documents/intern/sample_data/try.py", line 25, in <module>
result = sess.run(y)
File "/home/****/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/home/****/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/home/****/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
options, run_metadata)
File "/home/****/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnavailableError: Endpoint read failed
I googled it and some suggest that it might be the problems with proxy, so I have disabled the proxy but nothing changed.
Does anyone have any ideas what the problems might be? Many thanks in advance.
Never mind, problems solved. It is the setting about the proxy. We need to unset proxy on both servers and clients to make the program work.