Search code examples
pythontensorflowtfrecord

Integrating directory of TFRecord examples into model training


What is the most efficient way to feed data from multiple TFRecord files for purposes of training a Tensorflow model? With my current process, I iterate over the examples from TFRecords, separately extracting examples into Python variables, but I don't believe this is the proper way to do this.

I am migrating from Keras to Tensorflow hoping to see some speed improvements in my workflow. Towards that end, I've moved my data into TFRecord, and now I am trying to understand how to run basic linear regression models with a directory of TFRecord files. I have gotten to the point where I can read the TFRecord out into a Tensor and train in batches like so (code is taken from the Tensorflow getting started example and then modified):

# Model parameters
W = tf.Variable([.1], dtype=tf.float32)
b = tf.Variable([.1], dtype=tf.float32)

# Model input and output
x = tf.placeholder(tf.float32)
linear_model = W*x + b
y = tf.placeholder(tf.float32)

# loss
loss = tf.reduce_sum(tf.square(linear_model - y)) # sum of the squares
# optimizer
optimizer = tf.train.GradientDescentOptimizer(0.1)
train = optimizer.minimize(loss)


# Transforms a scalar string `example_proto` into a pair of a scalar string and
# a scalar integer, representing an image and its label, respectively.
def _parse_function(example_proto):
    keys_to_features = {
        "X": tf.FixedLenFeature([40], tf.float32),
        "Y": tf.FixedLenFeature([10], tf.float32)
    }
    example = tf.parse_single_example(example_proto, keys_to_features)
    return example["X"][0], example["Y"][0]

filenames = tf.placeholder(tf.string, shape=[None])
dataset   = tf.data.TFRecordDataset(filenames, "ZLIB")
dataset   = dataset.map(_parse_function)
dataset   = dataset.repeat()
dataset   = dataset.batch(1024)
iterator  = dataset.make_initializable_iterator()
next_element = iterator.get_next()

# training loop
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init) # reset values to wrong
sess.run(iterator.initializer, feed_dict = { filenames: training_filenames })
for i in range(10):
    **x_train, y_train = sess.run(iterator.get_next())**
    sess.run(train, {x: x_train, y: y_train})

My problem is that I do not believe this follows the intended, most efficient dataset workflow possible with Tensorflow. In particular, what is the point of extracting the data from binary into a python variable and then feeding it into the training process? (the line below)

    **x_train, y_train = sess.run(iterator.get_next())**

I was under the impression there should be a way that feeds the binary data into the session for training more directly, but after reading the TF tutorials, examples, and other stack overflow posts, I am not finding anything.


Solution

  • The dataset API is very versatile and flexible. It can be used to input as dictionaries as you did. However, a better way is to incorporate the dataset within the graph and make it process all at once.

    def model_function(input, label)
       # Model parameters
       W = tf.Variable([None, input.shape[1]], dtype=tf.float32)
       b = tf.Variable([input.shape[1]], dtype=tf.float32)
    
       # Model input and output
       x = input
       linear_model = W*x + b
       y = label
    
       # loss
       loss = tf.reduce_sum(tf.square(linear_model - y)) # sum of the squares
       # optimizer
       optimizer = tf.train.GradientDescentOptimizer(0.1)
       train = optimizer.minimize(loss)
    
       return train
    
    
    ---<Previous dataset related code>---
    
    iterator.make_initializable_iterator()
    next_example, next_label = iterator.get_next()
    
    train_op = model_function(next_example, next label)
    
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
    
        for steps in range(1000):
            _ = sess.run([train_op], feeddict={filenames: training_filenames})
    

    In this way the dataset operations are part of the main graph. This would also use the queuing structure of the dataset better. Since only one sess.run is used, the overhead of the run function is minimised.

    For more information have a look at this part of the documentation: Importing data | Tensorflow 1.4

    If you need training filenames which are specified at graph runtime you can only specify that placeholder in the feeddict. However, i suggest against that though. Filenames are rather static. I would use a resources file such as config.py and place all the config properties in that file. The filenames are then loaded on graph construction.

    To specify the filenames, there are two approaches. The first one:

    ...
    filenames = tf.constant([filename1.tfrecords, filename2.tfrecords], dtype=tf.String)
    dataset = tf.data.Dataset(filenames, "ZLIB")
    ...
    

    Or else a more proper approach would be to create a new directory in the main folder called resources, place and empty __init__.py file inside and another one called config.py. Inside config.py:

    --- inside config.py ---
    
    FILENAMES = ["filename1.tfrecord", "filename2.tfrecord"]
    

    Inside the main tensorflow function where the dataset is being created:

    --- inside tensorflow file ---
    
    from resources import config
    
    ...
    filenames = tf.constant(config.FILENAMES, dtype=tf.String)
    dataset = tf.data.Dataset(filenames, "ZLIB")
    ...