What's the difference with `hidden size` and `proj_size` in PyTorch LSTM?

I recently starting exploring LSTMs in PyTorch and I don't quite understand the difference between using hidden_size and proj_size when trying to define the output size of my LSTM?

For context, I have an input size of 5, sequence length of 30 and want to have an output size of 2, two outputs of sequence length 30 each. Should I just set my hidden_size to 2 or would it be better to use proj_size=2 so that I can tune the hidden_size hyper-parameter?

Solution

From the documentation

A unidirectional LSTM model gives three outputs:

output, the main output, size (L, N, H_out)
h_n, the hidden state, size (num_layers, N, H_out)
c_n, the cell state, size (num_layers, N, H_cell)

The proj_size argument changes the H_out of the model. If proj_size>0, H_out=proj_size, otherwise H_out=hidden_size. proj_size also changes the dimension of the hidden state weight, W_hi.

This feature is motivated by the paper linked in the documentation.

The paper found that using a smaller hidden size with a projection to match the cell size gave better parameter-adjusted performance.

To show the difference:

# without projection
lstm_kwargs = {
    'input_size' : 64,
    'hidden_size' : 512,
    'num_layers' : 3,
    'batch_first' : True
}

lstm1 = nn.LSTM(**lstm_kwargs)

[(k, v.shape) for k,v in lstm1.state_dict().items()]

> [('weight_ih_l0', torch.Size([2048, 64])),
 ('weight_hh_l0', torch.Size([2048, 512])),
 ('bias_ih_l0', torch.Size([2048])),
 ('bias_hh_l0', torch.Size([2048])),
 ('weight_ih_l1', torch.Size([2048, 512])),
 ('weight_hh_l1', torch.Size([2048, 512])),
 ('bias_ih_l1', torch.Size([2048])),
 ('bias_hh_l1', torch.Size([2048])),
 ('weight_ih_l2', torch.Size([2048, 512])),
 ('weight_hh_l2', torch.Size([2048, 512])),
 ('bias_ih_l2', torch.Size([2048])),
 ('bias_hh_l2', torch.Size([2048]))]

x = torch.randn(8, 12, 64)
x1, (h1, c1) = lstm1(x)

x1.shape
> torch.Size([8, 12, 512])

h1.shape
> torch.Size([3, 8, 512])

c1.shape
> torch.Size([3, 8, 512])

# with projection
lstm_kwargs = {
    'input_size' : 64,
    'hidden_size' : 512,
    'num_layers' : 3,
    'batch_first' : True
}

lstm2 = nn.LSTM(proj_size=256, **lstm_kwargs)

[(k, v.shape) for k,v in lstm2.state_dict().items()]

> [('weight_ih_l0', torch.Size([2048, 64])),
 ('weight_hh_l0', torch.Size([2048, 256])),
 ('bias_ih_l0', torch.Size([2048])),
 ('bias_hh_l0', torch.Size([2048])),
 ('weight_hr_l0', torch.Size([256, 512])),
 ('weight_ih_l1', torch.Size([2048, 256])),
 ('weight_hh_l1', torch.Size([2048, 256])),
 ('bias_ih_l1', torch.Size([2048])),
 ('bias_hh_l1', torch.Size([2048])),
 ('weight_hr_l1', torch.Size([256, 512])),
 ('weight_ih_l2', torch.Size([2048, 256])),
 ('weight_hh_l2', torch.Size([2048, 256])),
 ('bias_ih_l2', torch.Size([2048])),
 ('bias_hh_l2', torch.Size([2048])),
 ('weight_hr_l2', torch.Size([256, 512]))]

x = torch.randn(8, 12, 64)
x1, (h1, c1) = lstm1(x)

x1.shape
> torch.Size([8, 12, 256])

h1.shape
> torch.Size([3, 8, 256])

c1.shape
> torch.Size([3, 8, 512])

Note that with projection, we have additional weight matrices for the projection, and the output/hidden sizes are changed. The cell size is not.

To your question about the output size, typically you would use another layer on top of the LSTM to predict your exact output.