python tensorflow sequences tensorflow-datasets

Dataset directly from tf.train.SequenceExample

I'm working with a NER-like sequence tagging in tensorflow and decided to try tf.data to see if I can get IO performance improvements with my model.

At the moment I am employing TFRecordWriter to preprocess and save my training/validation data, which is a tf.train.SequenceExample() serialized to string. I then load it with tf.data.TFRecordDataset, parse/shuffle/padded_batch it and get on with training, which works fine.

Question is:

is there a convenient way to make the dataset without first serializing and saving the SeuquenceExamples to tfrecord file?
It seems to be an unnecessary step when I'll be running the predictions on new data. I've tried playing with tf.data.Dataset.from_tensor_slices(), but it seems not suitable in this scenario as the inputs are sequences of different lengths that are not padded yet.

Solution

It may be possible to use tf.data.Dataset.from_generator() for this case. For example, let's say your examples look like the following very simple data, with two features (of which the second represents sequential data):

examples = [("foo", [1, 2, 3, 4, 5]),
            ("bar", [6, 7]),
            ("baz", [8, 9, 10])]

You could convert this to a tf.data.Dataset with the following code:

def example_generator():
  for string_feature, sequence_feature in examples:
    yield string_feature, sequence_feature

dataset = tf.data.Dataset.from_generator(
    example_generator,
    output_types=(tf.string, tf.int32),
    output_shapes=([], [None]),  # A scalar and a variable-length vector.  
)