tensorflow batching recurrent-neural-network sequence-to-sequence

Tensor flow continuous text sequence-to-sequence. Why batch?

I'm working through building a sequence-to-sequence shakespeare predictor and looking at sample code it seems to do batching in groups of 50 characters. I'm a little confused by this. If the text is continuous and you are processing in 50-character chunks, then surely that means you're only ever calculating loss based on the next expected character after the 50th character, and the model is never being trained on the next expected characters for the other 49 characters. In other words if you have 1000 characters with 20 sets of 50 characters it's only ever being taught about predicting 20 different characters. Shouldn't these batches be shifting by a random offset each epoch so it learns how to predict the other characters?

This can't be right, surely? What am I missing here in my understanding?

Also, are the batches always processed sequentially? When the state is being carried forward to represent the previous sequences, surely this is important.

Thanks Ray

Update 7/24: Here is the original code...

    self.num_batches = int(self.tensor.size / (self.batch_size *
                                               self.seq_length))

    # When the data (tensor) is too small,
    # let's give them a better error message
    if self.num_batches == 0:
        assert False, "Not enough data. Make seq_length and batch_size small."

    self.tensor = self.tensor[:self.num_batches * self.batch_size * self.seq_length]
    xdata = self.tensor
    ydata = np.copy(self.tensor)
    ydata[:-1] = xdata[1:]
    ydata[-1] = xdata[0]
    self.x_batches = np.split(xdata.reshape(self.batch_size, -1),
                              self.num_batches, 1)
    self.y_batches = np.split(ydata.reshape(self.batch_size, -1),
                              self.num_batches, 1)

As far as I can see it doesn't seem to be overlapping, but I am new at Python so may be missing something.

Solution

If you have 1000 chars and if you create 20 sets of 50 chars, that becomes a non-overlapping window, and as you said it won't work. Instead you consider overlapping window by shifting by one char and create (1000-50) sets of training data. This is the right way to do it.