Search code examples
artificial-intelligencelstm

In an LSTM, how is [h(t-1)] the same size as [h(t)]?


I can't seem to find the answer to this specific question anywhere. I'm recreating an LSTM from scratch, because I want to understand it better.

I've drawn out my current understanding of an LSTM, and attached it to this post.

If it takes h(t-1) and concatenates it with x(t), that would make a vector of a larger size than h(t-1). Sigmoid is later applied to this concatenated vector, and tanh is applied to the cell state, then they are multiplied together. This produces the new hidden state.

So how is h(t) not larger in size than h(t-1)? Why does the hidden state not grow with each timestep?

Illustration


Solution

  • Hm, there's some projection steps hidden inside some of the steps in the diagram. The "sigmoid" symbol in the diagram really means applying a sigmoid function to the output of a linear projection operation. That is; using @ for matrix multiplication, numpy style, you're not simply taking sigmoid([h(t-1); x(t)]), you're actually taking sigmoid(W @ x(t) + U @ h(t-1)) (leaving out bias term for now), where W, U are projection matrices with learned parameters.

    In matrix land, this is indeed mathematically equivalent to concatenating hx(t) = [h(t-1); x(t)] and learning some parameter V of appropriate size such that V @ hx(t) is the input to your sigmoid. In fact, V is just the horizontal concatenation of U, W (in that order) from above.

    Now, let's work through the example in your diagram. You have h(t-1) with shape (3,) and x(t) with shape (2,), we'd learn W with shape (3, 2) and U with shape (3, 3) to yield a final output of shape (3,), which is the same as h(t-1). Note that if we'd decided to represent this as the concatenated vector hx(t) with shape (5,), indeed, we could just horizontally merge U, W to reach something with shape (3, 5) -- which still yields the final output of desired shape (3,).

    To reach h(t), you need to do one more element-wise multiplication with a cell-state term (at the node marked x in your diagram), but that turns out to have shape (3,) as well.

    The Wikipedia page provides an exact overview with all the operations and dimensions as well, which is a more compact form of the equations provided in Section 2 of Gers, Schmidhuber, and Cummins.