What units are used to define CNTK's epoch size?

If I understand correctly, in CNTK Python API Trainer.previous_minibatch_sample_count is supposed to return the number of samples (and NOT sequences) in the previous mini-batch. I can see that it works as expected in LanguageUnderstanding example (i.e. the number of samples in the last minibatch is indeed close to the minibatch_size that is used):

minibatch_size = 70
...
Minibatch[   1-   1]: loss = 4.857261 * 67, metric = 100.0% * 67
Minibatch[   2-   2]: loss = 4.835442 * 63, metric = 60.3% * 63
Minibatch[   3-   3]: loss = 4.798552 * 68, metric = 36.8% * 68
Minibatch[   4-   4]: loss = 4.751775 * 70, metric = 35.7% * 70
Minibatch[   5-   5]: loss = 4.678326 * 65, metric = 30.8% * 65

Yet, if I modify the (separate) SequenceClassification example to use ProgressPrinter (the only change), I get the following output:

minibatch_size = 200
...
Minibatch[   1-   1]: loss = 1.611397 * 44, metric = 88.6% * 44
Minibatch[   2-   2]: loss = 1.611021 * 47, metric = 91.5% * 47
Minibatch[   3-   3]: loss = 1.608516 * 42, metric = 88.1% * 42
Minibatch[   4-   4]: loss = 1.611613 * 44, metric = 93.2% * 44
Minibatch[   5-   5]: loss = 1.610344 * 47, metric = 93.6% * 47

In the output above, ‘number of samples’ reported by trainer (40-50) is considerably less than the minibatch_size (200). I have manually confirmed that it looks like Trainer is returning the number of SEQUENCES in the minibatch, rather than the samples in the above case.

Is this something expected? If so, what is the logic here?

I can see that some tutorials/examples rely on the value returned from Trainer.previous_minibatch_sample_count in order to determine the end of the epoch… Will this always work reliably?

Solution

Collating multiple answers for different folks in the team:

The trainer returned count is #labels which is #sequences in this case. The minibatch_size specified is in terms of #samples (across all streams) and the minibatch_source returns a batch of samples such that no stream exceeds the specified count. In this case the feature stream has multiple words per sample and thus determines the bounding threshold.
The trainer returns the #samples that give rise to the gradient, i.e. the number of labels. It can also be thought of the number of items summed up in the objective function.