I am working on MXNet, a library for deep learning. My implemented structure is in both single machine and distributed CPU machines. I followed the tutorial on the MXNet official site. The implementation on single machine was running without any issue and I got the result.
Then I tried distributed training with multiple CPU machines. I created an account on AWS, amazon virtual machine, and launched 3 of t2.micro ubuntu. I typed the following command line:
../../tools/launch.py -n 2 python train_mnist.py --kv-store dist_sync
This command line above suppose to run a distributed version training with 2 workers and 1 server.
Unfortunately, I got an error. I understand there is a permission denied, but I tried to access other two workers from server by the following command and it works:
ssh -i key.pem ubuntu@ip number.
Here is the error
Permission denied (publickey).
Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/ubuntu/Research/code/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/ssh.py", line 60, in run
subprocess.check_call(prog, shell = True)
File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
raise CalledProcessError(retcode, cmd)
CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no ip -p 22 'export LD_LIBRARY_PATH=:/usr/local/cuda/lib64; export DMLC_SERVER_ID=0; export DMLC_WORKER_ID=0; export DMLC_PS_ROOT_URI=ip; export DMLC_ROLE=worker; export DMLC_PS_ROOT_PORT=9091; export DMLC_NUM_WORKER=1; export DMLC_NUM_SERVER=1; cd /home/ubuntu/Research/code/mxnet/example/image-classification/; python train_mnist.py --network lenet --kv-store dist_sync'' returned non-zero exit status 255
Permission denied (publickey).
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/ubuntu/Research/code/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/ssh.py", line 60, in run
subprocess.check_call(prog, shell = True)
File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
raise CalledProcessError(retcode, cmd)
CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no ip -p 22 'export LD_LIBRARY_PATH=:/usr/local/cuda/lib64; export DMLC_SERVER_ID=0; export DMLC_PS_ROOT_URI=ip; export DMLC_ROLE=server; export DMLC_PS_ROOT_PORT=9091; export DMLC_NUM_WORKER=1; export DMLC_NUM_SERVER=1; cd /home/ubuntu/Research/code/mxnet/example/image-classification/; python train_mnist.py --network lenet --kv-store dist_sync'' returned non-zero exit status 255
[03:48:39] /home/ubuntu/mxnet/dmlc-core/include/dmlc/./logging.h:300: [03:48:39] src/kvstore/kvstore.cc:37: compile with USE_DIST_KVSTORE=1 to use dist_sync
Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7ff5b87a156c]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet7KVStore6CreateEPKc+0x5f4) [0x7ff5b91323d4]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(MXKVStoreCreate+0xd) [0x7ff5b905b14d]
[bt] (3) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7ff5bb86dadc]
[bt] (4) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x1fc) [0x7ff5bb86d40c]
[bt] (5) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48e) [0x7ff5bba845fe]
[bt] (6) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(+0x15f9e) [0x7ff5bba85f9e]
[bt] (7) python(PyEval_EvalFrameEx+0x98d) [0x5244dd]
[bt] (8) python(PyEval_EvalCodeEx+0x2b1) [0x555551]
[bt] (9) python(PyEval_EvalFrameEx+0x7e8) [0x524338]
Traceback (most recent call last):
File "train_mnist.py", line 76, in <module>
fit.fit(args, sym, get_mnist_iter)
File "/home/ubuntu/Research/code/mxnet/example/image-classification/common/fit.py", line 97, in fit
kv = mx.kvstore.create(args.kv_store)
File "/usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/kvstore.py", line 403, in create
ctypes.byref(handle)))
File "/usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/base.py", line 77, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [03:48:39] src/kvstore/kvstore.cc:37: compile with USE_DIST_KVSTORE=1 to use dist_sync
Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7ff5b87a156c]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet7KVStore6CreateEPKc+0x5f4) [0x7ff5b91323d4]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(MXKVStoreCreate+0xd) [0x7ff5b905b14d]
[bt] (3) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7ff5bb86dadc]
[bt] (4) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x1fc) [0x7ff5bb86d40c]
[bt] (5) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48e) [0x7ff5bba845fe]
[bt] (6) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(+0x15f9e) [0x7ff5bba85f9e]
[bt] (7) python(PyEval_EvalFrameEx+0x98d) [0x5244dd]
[bt] (8) python(PyEval_EvalCodeEx+0x2b1) [0x555551]
[bt] (9) python(PyEval_EvalFrameEx+0x7e8) [0x524338]
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/ubuntu/Research/code/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/tracker.py", line 363, in <lambda>
target=(lambda: subprocess.check_call(self.cmd, env=env, shell=True)), args=())
File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
raise CalledProcessError(retcode, cmd)
CalledProcessError: Command 'python train_mnist.py --network lenet --kv-store dist_sync' returned non-zero exit status 1
You are seeing the error because the node where you launched the job does not have access to the other workers. the launch.py by default uses ssh to communicate with the workers. You need to have ssh agent forwarding setup from your local machine to the node(master), those credentials can be used to talk to ssh to other nodes in the cluster.
on MAC OS, -A would enable credential forwarding
ssh -A -i key.pem ubuntu@ip number.
You may to want to consider using the Deep Learning CloudFormation template that creates the stack and shows you steps to setup ssh-agent forwarding and CIFAR-10 example to run distributed training.