I am trying to read three different large tfrecords of different lengths, and read them all in parallel like this:
files = [ filename1, filename2, filename3 ]
data = tf.data.TFRecordDataset(files)
data = data.apply(
tf.contrib.data.parallel_interleave(
lambda filename: tf.data.TFRecordDataset(data),
cycle_length=3,block_length = [10,5,3]))
data = data.shuffle(
buffer_size = 100)
data = data.apply(
tf.contrib.data.map_and_batch(
map_func=parse,
batch_size=100))
data = data.prefetch(10)
,but TensorFlow does not allow for different block lengths per file source:
InvalidArgumentError: block_length must be a scalar
I could create three different datasets with different mini-batch size, but that requires 3x the resources, and that's not an option given by my machine limitations.
What are the possible solutions?
Here is the answer, I figured out how to do it within my constraints.
Make datasets for each file, define each mini batch size for each, and concatenate the get_next() outputs together. This fits on my machine and runs efficiently.