Search code examples
python-3.xtensorflowtfrecord

is there a way to get the size of TFRecord file and the size of one Example in it?


Because I want to get the number of Examples in a TFRecord file, the currently method I used is

len([x for x in tf.python_io.tf_record_iterator(tf_record_file)])

but it is slow. All Examples in my TFRecord file have exactly the same length, so I wonder that if there is a way to get the size (the number of bytes) of whole TFRecord file (xxx.tfrecord) and the size (the number of bytes) of one Example in it? Then I think I can just use

number_of_Examples = (length of TFRecord file) / (length of the first Example) = (bytes of all Examples in xxx.tfrecord) / (bytes of one Expmale)

to get the number of Examples more quickly.


Solution

  • A TFRecord file is essentially an array of Examples, and it does not include the number of examples as metadata. Thus, one must iterate over it to count the number of examples. Another option is saving the size as metadata on creation time (in some separate file).

    Edit:

    The approach you propose won't work as long as 2 examples may be of different sizes, which is sometimes the case even if the number of features is identical.

    If it is guaranteed that all examples have exactly the same number of bytes you could do the following:

    import os
    import sys
    import tensorflow as tf
    
    def getSize(filename):
        st = os.stat(filename)
        return st.st_size
    
    file = "..."
    
    example_size = 0
    example = tf.train.Example()
    for x in tf.python_io.tf_record_iterator(file):
        example.ParseFromString(x)
        example_size = example.ByteSize()
        break
    
    file_size = getSize(file)
    n = file_size / (example_size + 16)
    
    print("file size in bytes:{}".format(file_size))
    print("example size in bytes:{}".format(example_size))
    print("N:{}".format(n))