mxnet model does not produce same output for same input with no intermediate gradient backprop

I have some experience with Tensorflow but only about a week with mxnet. I am trying to understand the behavior of some code when I hit a break point in the function below:

def train_and_eval(lr, end_date_str, pred):
    model.collect_params().initialize(mx.init.Xavier(), ctx=ctx, force_reinit=True)
    mgr      = ProcessMgr(2, end_date_str)

    for epoch in range(args_epochs):    
        for i in range(2):
            if i == TRAIN_MODE:
                mgr.switch_to_train()
            elif epoch == args_epochs - 1 and i == VALIDATE_MODE:
                mgr.switch_to_validate()
            else:
                break

            while True:
                try:
                    data, target, eval_target, date_str = mgr.get_batch()
                    data        = gluon.utils.split_and_load(data, ctx)
                    target      = gluon.utils.split_and_load(target, ctx)
                    eval_target = gluon.utils.split_and_load(eval_target, ctx)
                    data        = [mx.nd.swapaxes(d, 0, 1) for d in data]

                    with autograd.record():                    
                        losses = [loss(model(X)[-args_batch_size:], Y) for X, Y in zip(data, target)]
                        null_loss_vals = sum([Y.square().sum().asscalar() for Y in target])
                        model_loss_vals = sum([sum(l).asscalar() for l in losses])
                        null_loss[i] += null_loss_vals
                        model_loss[i] += model_loss_vals

                        **pdb.set_trace() ## BREAK POINT IS HERE**
                        if i == TRAIN_MODE:
                            for l in losses:
                                l.backward()
                            x = 18
                            grads = [i.grad(ctx) for i in model.collect_params().values() if i._grad is not None]
                            gluon.utils.clip_global_norm(grads, args_clip)
                            trainer.step(GPU_COUNT * args_batch_size)
                except:
                    print("completed an epoch")
                    break

I am getting some unexpected values for the losses I am calculating, so I put a break point in to see what was going on. The problem is that when I run the same data through the model, I get different outputs each time. Below I paste some of the outputs I have when I hit the pdb breakpoint and try to run data through the model.

<NDArray 38400x1 @gpu(0)>
(Pdb) model(data[0])

[[ 2.9265028e-01]
 [ 9.3701184e-03]
 [ 4.3234527e-02]
 ...
 [-5.0668776e-09]
 [-2.7628975e-08]
 [-1.9340845e-08]]
<NDArray 38400x1 @gpu(0)>
(Pdb) model(data[0])

[[ 1.5275864e-01]
 [ 2.0615126e-01]
 [ 4.6957955e-02]
 ...
 [-2.6077061e-08]
 [-9.2040580e-09]
 [-3.2883932e-08]]
<NDArray 38400x1 @gpu(0)>
(Pdb) data[0]

[[[ 0. -4.]
  [ 0. -4.]
  [ 0. -4.]
  ...
  [ 0. -4.]
  [ 0. -4.]
  [ 0. -4.]]

 [[ 0. -4.]
  [ 0. -4.]
  [ 0. -4.]
  ...
  [ 0. -4.]
  [ 0. -4.]
  [ 0. -4.]]

 [[ 0. -4.]
  [ 0. -4.]
  [ 0. -4.]
  ...
  [ 0. -4.]
  [ 0. -4.]
  [ 0. -4.]]

 ...

 [[ 0.  0.]
  [ 0.  0.]
  [ 0.  0.]
  ...
  [ 0.  0.]
  [ 0.  0.]
  [ 0.  0.]]

 [[ 0.  0.]
  [ 0.  0.]
  [ 0.  0.]
  ...
  [ 0.  0.]
  [ 0.  0.]
  [ 0.  0.]]

 [[ 0.  0.]
  [ 0.  0.]
  [ 0.  0.]
  ...
  [ 0.  0.]
  [ 0.  0.]
  [ 0.  0.]]]
<NDArray 128x300x2 @gpu(0)>
(Pdb) data[0]

[[[ 0. -4.]
  [ 0. -4.]
  [ 0. -4.]
  ...
  [ 0. -4.]
  [ 0. -4.]
  [ 0. -4.]]

 [[ 0. -4.]
  [ 0. -4.]
  [ 0. -4.]
  ...
  [ 0. -4.]
  [ 0. -4.]
  [ 0. -4.]]

 [[ 0. -4.]
  [ 0. -4.]
  [ 0. -4.]
  ...
  [ 0. -4.]
  [ 0. -4.]
  [ 0. -4.]]

 ...

 [[ 0.  0.]
  [ 0.  0.]
  [ 0.  0.]
  ...
  [ 0.  0.]
  [ 0.  0.]
  [ 0.  0.]]

 [[ 0.  0.]
  [ 0.  0.]
  [ 0.  0.]
  ...
  [ 0.  0.]
  [ 0.  0.]
  [ 0.  0.]]

 [[ 0.  0.]
  [ 0.  0.]
  [ 0.  0.]
  ...
  [ 0.  0.]
  [ 0.  0.]
  [ 0.  0.]]]
<NDArray 128x300x2 @gpu(0)>
(Pdb)

I am perplexed as to what is going on here. I do realize my code may not be entirely proper in that I am not running anything in a predict or inference model (was planning to check/tackle that later), but I don't understand how the model itself seems to changing each time I run input into the model even though I am not running backward() or trainer.step(). Any insight would be appreciated. Why is this happening?

My only guess is that perhaps the hidden state is preserved between runs. But I thought I had not coded it to do so (I saw an example where this was done and the hidden state had to be explicitly saved and fed back into the RNN). In particular, I have not implemented a begin_state method for my gluon.Block. I am not sure how to verify or disprove this guess.

Here is my gluon.Block as implemented in case that is relevant:

class RNNModel(gluon.Block):
    def __init__(self, mode, num_inputs, num_embed, num_hidden,
                 num_layers, dropout=0.5, tie_weights=False, **kwargs):
        super(RNNModel, self).__init__(**kwargs)
        with self.name_scope():
            self.drop = nn.Dropout(dropout)
            self.rnn = rnn.GRU(num_hidden, num_layers, dropout=dropout,
                               input_size=num_inputs)
            self.decoder = nn.Dense(1, in_units = num_hidden)
            self.num_hidden = num_hidden

    def forward(self, inputs):
        output = self.rnn(inputs)
        output = self.drop(output)
        decoded = self.decoder(output.reshape((-1, self.num_hidden)))
        return decoded

Solution

I determined that within the with autograd.record() context, the hidden state must keep evolving, because I did not see this behavior outside of this context. Because my model does not provide a variable which exposes the hidden state I was not able to verify this explicitly, but it makes the most sense. Also I was able to confirm that the weights that are exposed (via trainer._params) were not changing, so it had to be the hidden state.