Search code examples
tensorflowdistributeddeep-learning

Failed to run tensorflow distributed MNIST test


I installed tensorflow 0.8 by building from source. I use AWS EC2 g2.8xlarge instance which has 4 GPUs. I tried to run tensorflow distributed mnist test, code in here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/dist_test/scripts/dist_mnist_test.sh

my script:

bash dist_mnist_test.sh "grpc://localhost:2223 grpc://localhost:2224"

and I got this message:

E0609 14:53:07.430440599   62872 tcp_client_posix.c:173]     failed to connect to 'ipv4:127.0.0.1:2223': socket error: connection refused
E0609 14:53:07.445297934   62873 tcp_client_posix.c:173]     failed to connect to 'ipv4:127.0.0.1:2224': socket error: connection refused

Any one know what is wrong here? Thanks a lot!


Solution

  • This script does not run standalone. In particular, it expects that you have created a TensorFlow cluster with workers running at each of the addresses before running the script. The create_tf_cluster.sh script can set up such a cluster using Kubernetes. The dist_test.sh script runs these scripts end-to-end.

    See my answer to your other question, which has a suggested script for running MNIST on distributed TensorFlow.