I am applying machine learning in MNIST through Tensorflow. I do this on a cluster where every nodes runs a distributed execution of Tensorflow. I run the individual executions via a bash script on a master node. This master node connects to a set of nodes from the cluster using ssh, and then runs the Python script running Tensorflow.
While Tensorflow is running on nodes I often get the following error message, causing a node to crash:
2017-03-29 20:34:02.749498: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:239] Started server with target: grpc://localhost:8338
Extracting /home/mvo010/.tmp/tensorflow/mnist/input_data/train-images-idx3-ubyte.gz
Extracting /home/mvo010/.tmp/tensorflow/mnist/input_data/train-labels-idx1-ubyte.gz
Extracting /home/mvo010/.tmp/tensorflow/mnist/input_data/t10k-images-idx3-ubyte.gz
Extracting /home/mvo010/.tmp/tensorflow/mnist/input_data/t10k-labels-idx1-ubyte.gz
Traceback (most recent call last):
File "/home/mvo010/inf3203-1/mnist_softmax.py", line 173, in <module>
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/share/apps/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/home/mvo010/inf3203-1/mnist_softmax.py", line 24, in main
mnist = input_data.read_data_sets(FLAGS.data_dir, one_hot=True)
File "/share/apps/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py", line 256, in read_data_sets
train = DataSet(train_images, train_labels, dtype=dtype, reshape=reshape)
File "/share/apps/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py", line 138, in __init__
images = numpy.multiply(images, 1.0 / 255.0)
MemoryError
This is because of low memory. When I log in on a node to check the memory I discover that the free memory is really low. The problem is that memory on the node does not get freed when a node is done (or gets killed by the master bash script when it timeouts).
Is there a trivial way to clean up memory for a node after quitting the Tensorflow application? I do not have any sudo permissions.
I got some inspiration from the example from AWS (https://aws.amazon.com/blogs/compute/distributed-deep-learning-made-easy/).
Running pkill -f python
before I run the python script on each worker and parameter server host solved this problem.