Search code examples
pythontensortfrecord

Tensorflow - Read different block_lengths from multiple tfrecords with parallel_interleave?


I am trying to read three different large tfrecords of different lengths, and read them all in parallel like this:

files = [ filename1, filename2, filename3 ]

data = tf.data.TFRecordDataset(files)

data = data.apply(
    tf.contrib.data.parallel_interleave(
        lambda filename: tf.data.TFRecordDataset(data),
        cycle_length=3,block_length = [10,5,3]))

data = data.shuffle(
    buffer_size = 100)

data = data.apply(
    tf.contrib.data.map_and_batch(
        map_func=parse, 
        batch_size=100))

data = data.prefetch(10)

,but TensorFlow does not allow for different block lengths per file source:

InvalidArgumentError: block_length must be a scalar

I could create three different datasets with different mini-batch size, but that requires 3x the resources, and that's not an option given by my machine limitations.

What are the possible solutions?


Solution

  • Here is the answer, I figured out how to do it within my constraints.

    Make datasets for each file, define each mini batch size for each, and concatenate the get_next() outputs together. This fits on my machine and runs efficiently.