Because I want to get the number of Examples in a TFRecord file, the currently method I used is
len([x for x in tf.python_io.tf_record_iterator(tf_record_file)])
but it is slow. All Examples in my TFRecord file have exactly the same length, so I wonder that if there is a way to get the size (the number of bytes) of whole TFRecord file (xxx.tfrecord) and the size (the number of bytes) of one Example in it? Then I think I can just use
number_of_Examples = (length of TFRecord file) / (length of the first Example) = (bytes of all Examples in xxx.tfrecord) / (bytes of one Expmale)
to get the number of Examples more quickly.
A TFRecord file is essentially an array of Example
s, and it does not include the number of examples as metadata. Thus, one must iterate over it to count the number of examples. Another option is saving the size as metadata on creation time (in some separate file).
The approach you propose won't work as long as 2 examples may be of different sizes, which is sometimes the case even if the number of features is identical.
If it is guaranteed that all examples have exactly the same number of bytes you could do the following:
import os
import sys
import tensorflow as tf
def getSize(filename):
st = os.stat(filename)
return st.st_size
file = "..."
example_size = 0
example = tf.train.Example()
for x in tf.python_io.tf_record_iterator(file):
example.ParseFromString(x)
example_size = example.ByteSize()
break
file_size = getSize(file)
n = file_size / (example_size + 16)
print("file size in bytes:{}".format(file_size))
print("example size in bytes:{}".format(example_size))
print("N:{}".format(n))