Search code examples
tensorflowtfrecord

Tensorflow: Count number of examples in a TFRecord file -- without using deprecated `tf.python_io.tf_record_iterator`


Please read post before marking Duplicate:

I was looking for an efficient way to count the number of examples in a TFRecord file of images. Since a TFRecord file does not save any metadata about the file itself, the user has to loop through the file in order to calculate this information.

There are a few different questions on StackOverflow that answer this question. The problem is that all of them seem to use the DEPRECATED tf.python_io.tf_record_iterator command, so this is not a stable solution. Here is the sample of existing posts:

Obtaining total number of records from .tfrecords file in Tensorflow

Number of examples in each tfrecord

Number of examples in each tfrecord

So I was wondering if there was a way to count the number of records using the new Dataset API.


Solution

  • There is a reduce method listed under the Dataset class. They give an example of counting records using the method:

    # generate the dataset (batch size and repeat must be 1, maybe avoid dataset manipulation like map and shard)
    ds = tf.data.Dataset.range(5) 
    # count the examples by reduce
    cnt = ds.reduce(np.int64(0), lambda x, _: x + 1)
    
    ## produces 5
    

    Don't know whether this method is faster than the @krishnab's for loop.