TensorFlow provides 3 different formats for data to be stored in a tf.train.Feature
. These are:
tf.train.BytesList
tf.train.FloatList
tf.train.Int64List
I often struggle to choose between tf.train.Int64List
/ tf.train.FloatList
and tf.train.BytesList
.
I see some examples online where they convert ints/floats into bytes and then store them in a tf.train.BytesList
. Is this preferable to using one of the other formats? If so, why does TensorFlow even provide tf.train.Int64List
and tf.train.FloatList
as optional formats when you could just convert them to bytes and use tf.train.BytesList
?
Thank you.
Because bytes list will require more memory. It's designed to store string data, or for example numpy arrays converted to single bytestring. Consider example:
def int64_feature(value):
if type(value) != list:
value = [value]
return tf.train.Feature(int64_list=tf.train.Int64List(value=value))
def float_feature(value):
if type(value) != list:
value = [value]
return tf.train.Feature(float_list=tf.train.FloatList(value=value))
def bytes_feature(value):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
writer = tf.python_io.TFRecordWriter('file.tfrecords')
bytes = np.array(1.1).tostring()
int = 1
float = 1.1
example = tf.train.Example(features=tf.train.Features(feature={'1': float_feature(float)}))
writer.write(example.SerializeToString())
writer.close()
for str_rec in tf.python_io.tf_record_iterator('file.tfrecords'):
example = tf.train.Example()
example.ParseFromString(str_rec)
str = (example.features.feature['1'].float_list.value[0])
print(getsizeof(str))
For dtype float
it will output 24 bytes, the lowest value. However, you can't pass int
to a tf.train.FloatList
. int
dtype will occupy 28 bytes in this case, while bytes will be 41 undecoded(before applying np.fromstring
) and even more after.