Search code examples
pythontensorflowdataformat

When should one use tf.train.BytesList, tf.train.FloatList, and tf.train.Int64List for data to be stored in a tf.train.Feature?


TensorFlow provides 3 different formats for data to be stored in a tf.train.Feature. These are:

tf.train.BytesList
tf.train.FloatList
tf.train.Int64List

I often struggle to choose between tf.train.Int64List / tf.train.FloatList and tf.train.BytesList.

I see some examples online where they convert ints/floats into bytes and then store them in a tf.train.BytesList. Is this preferable to using one of the other formats? If so, why does TensorFlow even provide tf.train.Int64List and tf.train.FloatList as optional formats when you could just convert them to bytes and use tf.train.BytesList?

Thank you.


Solution

  • Because bytes list will require more memory. It's designed to store string data, or for example numpy arrays converted to single bytestring. Consider example:

    def int64_feature(value):
        if type(value) != list:
            value = [value]
        return tf.train.Feature(int64_list=tf.train.Int64List(value=value))
    
    def float_feature(value):
        if type(value) != list:
            value = [value]
        return tf.train.Feature(float_list=tf.train.FloatList(value=value))
    
    def bytes_feature(value):
        return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
    
    writer = tf.python_io.TFRecordWriter('file.tfrecords')
    bytes = np.array(1.1).tostring() 
    int = 1
    float = 1.1
    example = tf.train.Example(features=tf.train.Features(feature={'1': float_feature(float)}))
    writer.write(example.SerializeToString())
    writer.close()
    
    for str_rec in tf.python_io.tf_record_iterator('file.tfrecords'):
        example = tf.train.Example()
        example.ParseFromString(str_rec)
        str = (example.features.feature['1'].float_list.value[0])
        print(getsizeof(str))
    

    For dtype float it will output 24 bytes, the lowest value. However, you can't pass int to a tf.train.FloatList. int dtype will occupy 28 bytes in this case, while bytes will be 41 undecoded(before applying np.fromstring) and even more after.