Search code examples
pythonnumpytensorflowtensorflow-datasetstfrecord

Serialize a tf.train.Dataset with Tensor features into a tfrecord file?


My Dataset looks like this:

dataset1 = tf.data.Dataset.from_tensor_slices((
  tf.random.uniform([4, 100], maxval=100, dtype=tf.int32),
  tf.random.uniform([4])))

for record in dataset1.take(2):
  print(record)
print(type(record))
(<tf.Tensor: shape=(100,), dtype=int32, numpy=
array([28, 96,  6, 22, 36, 33, 34, 29, 20, 77, 40, 82, 45, 81, 62, 59, 30,
       86, 44, 17, 43, 32, 19, 32, 96, 24, 14, 65, 43, 59,  0, 96, 20, 17,
       54, 31, 88, 72, 88, 55, 57, 63, 92, 50, 95, 76, 99, 63, 95, 82, 22,
       36, 87, 56, 44, 29, 12, 45, 82, 27, 56, 32, 44, 66, 77, 99, 97, 58,
       52, 81, 42, 54, 78,  3, 29, 86, 59, 98, 67, 39, 25, 27, 16, 46, 68,
       81, 72, 30, 53, 95, 33, 71, 93, 82, 95, 55, 13, 53, 30, 21],
      dtype=int32)>, <tf.Tensor: shape=(), dtype=float32, numpy=0.42071342>)
(<tf.Tensor: shape=(100,), dtype=int32, numpy=
array([71, 52,  9, 25, 94, 45, 64, 56, 99, 92, 62, 96, 13, 97, 39, 10, 27,
       41, 81, 37, 38, 20, 77, 11, 26, 28, 55, 99, 50,  7, 89,  2, 66, 64,
       11, 97,  4, 30, 34, 20, 81, 86, 68, 84, 75,  4, 22, 35, 87, 44, 57,
       94, 27, 19, 60, 37, 38, 83, 39, 75, 65, 80, 97, 72, 20, 69, 35, 20,
       37,  5, 60, 11, 84, 46, 25, 30, 13, 74,  5, 82, 34,  1, 79, 91, 41,
       83, 94, 80, 79,  6,  3, 26, 84, 20, 53, 78, 93, 36, 54, 44],
      dtype=int32)>, <tf.Tensor: shape=(), dtype=float32, numpy=0.73927164>)
<class 'tuple'>

So each record is a tuple of two Tensors, one is an input and another is an output to a model. I am trying to convert this Dataset into a .tfrecord file which requires me to make an Example out of each record. Here is my attempt:

def _bytes_feature(value):
  """Returns a bytes_list from a string / byte."""
  if isinstance(value, type(tf.constant(0))):
    value = value.numpy()  # BytesList won't unpack a string from an EagerTensor.
  return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))


def _float_feature(value):
  """Returns a float_list from a float / double."""
  return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))


def serialize_example(feature1, feature2):
  feature = {
    'feature1': _bytes_feature(tf.io.serialize_tensor(feature1)),
    'feature2': _float_feature(feature2),
  }

  example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
  return example_proto.SerializeToString()

When I do dataset1.map(serialize_example), I expect my code to work before doing

writer = tf.data.experimental.TFRecordWriter(some_path)
writer.write(dataset1)

However, I get the following error when I try dataset1.map(serialize_example):

...
value = value.numpy()  # BytesList won't unpack a string from an EagerTensor.
AttributeError: 'Tensor' object has no attribute 'numpy'

How should I convert this dataset into a .tfrecord file?


Solution

  • I tried to follow the doc and this is what I could come up with (you can test it right away here in a colab):

    import tensorflow as tf
    
    dataset1 = tf.data.Dataset.from_tensor_slices((
      tf.random.uniform([4, 100], maxval=100, dtype=tf.int32),
      tf.random.uniform([4])))
    
    def _bytes_feature(value):
      """Returns a bytes_list from a string / byte."""
      if isinstance(value, type(tf.constant(0))):
        value = value.numpy()  # BytesList won't unpack a string from an EagerTensor.
      return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
    
    
    def _float_feature(value):
      """Returns a float_list from a float / double."""
      return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))
    
    
    def serialize_example(feature1, feature2):
      feature = {
        'feature1': _bytes_feature(tf.io.serialize_tensor(feature1)),
        'feature2': _float_feature(feature2),
      }
    
      example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
      return example_proto.SerializeToString()
    
    def tf_serialize_example(f0,f1):
      tf_string = tf.py_function(
        serialize_example,
        (f0, f1),  # Pass these args to the above function.
        tf.string)      # The return type is `tf.string`.
      return tf.reshape(tf_string, ()) # The result is a scalar.
    
    dataset1 = dataset1.map(tf_serialize_example)
    writer = tf.data.experimental.TFRecordWriter('test.tfrecord')
    writer.write(dataset1)
    

    Basically the main part is to write a tf.py_function. This is because serialize_example is a non-tensor like function: you can't use .numpy() in graph mode. This is what AttributeError: 'Tensor' object has no attribute 'numpy' was (albeit clumsily) trying to tell you. The difference is that an EagerTensor will have a .numpy() method.

    One additional thing: if you don't need tf.int32 as a datatype for your input, you could go with tf.int64 and use the following function:

    def _int64_feature(value):
      """Returns an int64_list from a bool / enum / int / uint."""
      return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
    

    I think this function is tensor-like so you don't need the tf.py_function, but I haven't tried it. Of course you could also cast to float32, or float64 but this would be much heavier in storing.