I'm working with a NER
-like sequence tagging in tensorflow
and decided to try tf.data
to see if I can get IO performance improvements with my model.
At the moment I am employing TFRecordWriter
to preprocess and save my training/validation data, which is a tf.train.SequenceExample()
serialized to string. I then load it with tf.data.TFRecordDataset
, parse/shuffle/padded_batch it and get on with training, which works fine.
Question is:
dataset
without first serializing
and saving the SeuquenceExamples to tfrecord
file?tf.data.Dataset.from_tensor_slices()
, but it seems not suitable in this scenario as the inputs are sequences of different lengths that are not padded yet.It may be possible to use tf.data.Dataset.from_generator()
for this case. For example, let's say your examples look like the following very simple data, with two features (of which the second represents sequential data):
examples = [("foo", [1, 2, 3, 4, 5]),
("bar", [6, 7]),
("baz", [8, 9, 10])]
You could convert this to a tf.data.Dataset
with the following code:
def example_generator():
for string_feature, sequence_feature in examples:
yield string_feature, sequence_feature
dataset = tf.data.Dataset.from_generator(
example_generator,
output_types=(tf.string, tf.int32),
output_shapes=([], [None]), # A scalar and a variable-length vector.
)