Search code examples
deep-learningpytorchconv-neural-networklstmrecurrent-neural-network

Reducing the dimensions of a 4D feature tensor from ResNet to fit into a 2D LSTM model


I am designing a machine learning model that takes a feature tensor from ResNet and uses an LSTM to identify the sequences of letters in the image. The feature tensor that's from ResNet is 4-D , however, LSTM_cell wants inputs that are 2-D. I know about other methods such as .view() and .squeeze() that are able to reduce dimensions. However, it seems as if I do this, it changes the size of the dimensions of the feature vectors. At first the vector is [128, 2, 5, 512] but it needs to be [128, 512]. However, calling .view(-1,512) multiplies the dimensions to get [1280, 512]. How would you change dimensions without multiplying?


Solution

  • Outputs of CNN should be a 3-D Tensor (e.g. [128, x, 512]) so that it can be treated as a sequence. Then you can feed them into nn.LSTMCell() with an x-iteration for-loop.

    However, 4-D Tensor remains some spatial features and it is not appropriate to be fed into LSTM. A typical practice is to redesign your CNN architecture to make sure that it produces a 3-D Tensor. For example, you can add an nn.Conv2d() or something else at the end of CNN network to make the outputs as shape [128, x, 512].