Search code examples
pytorchgru

Why the output and hidden state of last layer returned by GRU are not the same


Okay, here is the document of Pytorch, GRU return 2 variables:

  • output has a shape of (N,L,D∗Hout)(N,L,D∗Hout​) containing the output features (h_t) from the last layer of the GRU, for each t.
  • hidden (D∗num_layers,N,Hout​) containing the final hidden state for the input sequence.

So as I understand, if I have GRU one layer and unidirectional then when I get the last time step of output, it will be the hidden state returned by GRU. I have checked this by code, and it is correct, but when I use bidirectional then the hidden state and the last time step just the same in the forward pass, the backward pass is different? Does GRU apply additional layer to the backward?

input_ids = torch.randn((1,4,3))
gru = nn.GRU(input_size=3, hidden_size=5, bidirectional=True, num_layers=1, batch_first=True)
o, h = gru(input_ids)

print(o[:,-1,:]) # get the last time step of the last layer hidden state
print(torch.cat([h[0::2, :, :], h[1::2,:,:]], dim=-1)) # concat forward and backward hidden state 

tensor([[-0.2308,  0.0597, -0.4346, -0.3713,  0.2811,  0.2582, -0.0627,  0.3114,
         -0.3813,  0.0590]], grad_fn=<SliceBackward0>)

tensor([[[-0.2308,  0.0597, -0.4346, -0.3713,  0.2811,  0.3513, -0.0135,
           0.1894, -0.4211,  0.0058]]], grad_fn=<CatBackward0>)

Edit:

I found this post: https://discuss.pytorch.org/t/missing-or-conflicting-documentations-between-versions/166075/2


Solution

  • Since the GRU is moving both ways, the first chunk of the hidden state represents the forward direction, and the second chunk the backward direction.

    As a result, the first chunk represents the last output, and the second chunk represents the first output.

    bs = 1
    sl = 4
    input_size = 3
    hidden_size = 5
    bidir=True
    n_layers=1
    
    input_ids = torch.randn((bs, sl, input_size))
    gru = nn.GRU(
                input_size=input_size, 
                hidden_size=hidden_size, 
                bidirectional=bidir, 
                num_layers=n_layers, 
                batch_first=True
            )
    
    o, h = gru(input_ids)
    
    (o[:,-1][:, :hidden_size] == h[0]).all()
    > tensor(True)
    
    (o[:,0][:, hidden_size:] == h[1]).all()
    > tensor(True)