What I need help with / What I was wondering
Hi, I am trying to load external tfrecord files with TFDS. I have read the official doc here, and find I need to define the feature structure using tfds.features
. However, since the tfrecords files are alreay generated, I do not have control the generation pipeline. I do, however, know the tf.train.Example
structre used in TFRecordWriter
during generation, shown as follows.
from tensorflow.python.training.training import BytesList, Example, Feature, Features, Int64List
dict(Example=Features({
'image': Feature(bytes_list=BytesList(value=[img_str])), # img_str is jpg encoded image raw bytes
'caption': Feature(bytes_list=BytesList(value=[caption])), # caption is a string
'height': Feature(bytes_list=Int64List(value=[caption])),
'width': Feature(bytes_list=Int64List(value=[caption])),
})
The doc only describes how to translate tfds.features into the human readable structure of the tf.train.Example. But nowhere does it mention how to translate a tf.train.Example into tfds.features, which is needed to automatically add the proper metadata fileswith tfds.folder_dataset.write_metadata.
I wonder how to translate the above tf.train.Example into tfds.features? Thanks a lot!
BTW, while I understand that it is possible to directly read the data as it is in TFRecord with tf.data.TFRecordDataset
and then use map(decode_fn)
for decoding as suggested here, it seems to me this approach lacks necessary metadata like num_shards
or shard_lengths
. In this case, I am not sure if it is still ok to use common operations like cache/repeat/shuffle/map/batch
on that tf.data.TFRecordDataset
. So I think it is better to stick to the tfds approach.
What I've tried so far
I have searched the official doc for quite some time but cannot find the answer. There is a Scalar
class in tfds.features
, which I assume could be used to decode Int64List
. But How can I decode the BytesList
?
Environment information
tensorflow-datasets
version: 4.8.2tensorflow
version: 2.11.0After some searching, I find the simplest solution is
features = tfds.features.FeaturesDict({
'image': tfds.features.Image(), # << Ideally best if add the `shape=(h, w, c)` info
'caption': tfds.features.Text(),
'height': tf.int32,
'width': tf.int32,
})
Then I can load the data either with tfds.folder_dataset.write_metadata
or directly with:
ds = tf.data.TFRecordDataset()
ds = ds.map(features.deserialize_example)
batch, shuffle ,...
will work as expected in both cases.
The TFDS metadata would be helpful if fine-grained split control is needed.