I am using 3 machines for distributed tensorflow (2 workers and 1 ps). All lie on the same cluster. I have placed my data on worker 1. My model works well but it uses only ps and 1 worker. My question is how is data placed so that all my workers can access it? Should I place it in shared memory like hdfs?
tf.reset_default_graph()
if FLAGS.job_name == "ps":
server.join()
elif FLAGS.job_name == "worker":
# Between-graph replication
with tf.device(tf.train.replica_device_setter( worker_device="/job:worker/task:%d" % FLAGS.task_index, cluster=Cluster)):
## here defining my model, cost, optimizer
sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0), global_step=global_step, init_op=init_op)
with sv.prepare_or_wait_for_session(server.target) as sess:
for epoch in range(training_epochs):
cost_val = sess.run([ cost ], feed_dict={X: data})
Found some relevant info here: [1]GRPC causes training to pause in individual worker (distributed tensorflow, synchronised) Appears that we need to create TFRecords.