Data changing between calls in Tensorflow

I made a little change to Tensorflow MNIST tutorial. Original code (fully_connected_feed.py, lines 194-202):

checkpoint_file = os.path.join(FLAGS.log_dir, 'model.ckpt')
saver.save(sess, checkpoint_file, global_step=global_step)
#Evaluate against the training set.
print('Training Data Eval:')
do_eval(sess, 
        eval_correct, 
        images_placeholder,
        labels_placeholder,
        data_sets.train)

I simply added one more evaluation:

checkpoint_file = os.path.join(FLAGS.log_dir, 'model.ckpt')
saver.save(sess, checkpoint_file, global_step=global_step)
print('Something strange:')
do_eval(sess, eval_correct, images_placeholder,labels_placeholder,
        data_sets.train)
#Evaluate against the training set.
print('Training Data Eval:')
do_eval(sess, 
        eval_correct, 
        images_placeholder,
        labels_placeholder,
        data_sets.train)

Results of this evaluations are close, but not same (numbers vary from launch to launch):

Something strange:
  Num examples: 55000  Num correct: 49218  Precision @ 1: 0.8949
Training Data Eval:
  Num examples: 55000  Num correct: 49324  Precision @ 1: 0.8968

How does it possible? UPD: added link to tensorflow github: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/examples/tutorials/mnist

Solution

The do_eval() function in fact does have a side effect, because data_sets.train is a stateful DataSet object that contains a current _index_in_epoch member, which is advanced on each call to DataSet.next_batch() (i.e. in fill_feed_dict()).

On its own, this fact shouldn't be enough to give non-deterministic results, but there are two other details about DataSet.next_batch() that lead to the non-determinism:

Every time a new epoch is started, the examples are randomly shuffled.
When the data set reaches the end of an epoch, the data set resets to the start and the last num_examples % batch_size examples are discarded. Thanks to the random shuffling, a random sub-batch of examples is discarded each time, leading to the non-deterministic results.

Given the way the code is structured (with the DataSet shared between training and testing), it's tricky to make the code deterministic. The DataSet class is sparsely documented, but this behavior is surprising, so I'd consider filing a GitHub issue about this problem.