python machine-learning tensorflow distributed-computing

How to enable Dataset pipeline has distributed reading and consuming

It is easy to use two threads that one keeps feeding data to the queue and the other consumes data from the queue and perform the computation. Since the TensorFlow recommends Dataset as input pipeline after 1.2.0., I would like to use the Dataset and its iterator to accomplish the task above, namely:

There are two processes, one feeds and the other consumes;
The pipeline suspends either it is full or empty and it stops when computation finishes at consuming.

P.S. Why in the tutorial of Threading and Queues, TensorFlow uses thread instead of process?

Thank you in advance.

Solution

Distributed tf.contrib.data pipelines are not yet supported as of TensorFlow 1.3. We are working on support for splitting datasets across devices and/or processes, but that support is not yet ready.

In the meantime, the easiest way to achieve your goal is to use a tf.FIFOQueue. You can define a Dataset that reads from a queue as follows:

q = tf.FIFOQueue(...)

# Define a dummy dataset that contains the same value repeated indefinitely.
dummy = tf.contrib.data.Dataset.from_tensors(0).repeat(None)

dataset_from_queue = dummy.map(lambda _: q.dequeue())

You can then compose other Dataset tranformations with dataset_from_queue.