Search code examples
pythontensorflowsequencestensorflow-datasets

Dataset directly from tf.train.SequenceExample


I'm working with a NER-like sequence tagging in tensorflow and decided to try tf.data to see if I can get IO performance improvements with my model.

At the moment I am employing TFRecordWriter to preprocess and save my training/validation data, which is a tf.train.SequenceExample() serialized to string. I then load it with tf.data.TFRecordDataset, parse/shuffle/padded_batch it and get on with training, which works fine.

Question is:

  • is there a convenient way to make the dataset without first serializing and saving the SeuquenceExamples to tfrecord file?
  • It seems to be an unnecessary step when I'll be running the predictions on new data. I've tried playing with tf.data.Dataset.from_tensor_slices(), but it seems not suitable in this scenario as the inputs are sequences of different lengths that are not padded yet.

Solution

  • It may be possible to use tf.data.Dataset.from_generator() for this case. For example, let's say your examples look like the following very simple data, with two features (of which the second represents sequential data):

    examples = [("foo", [1, 2, 3, 4, 5]),
                ("bar", [6, 7]),
                ("baz", [8, 9, 10])]
    

    You could convert this to a tf.data.Dataset with the following code:

    def example_generator():
      for string_feature, sequence_feature in examples:
        yield string_feature, sequence_feature
    
    dataset = tf.data.Dataset.from_generator(
        example_generator,
        output_types=(tf.string, tf.int32),
        output_shapes=([], [None]),  # A scalar and a variable-length vector.  
    )