Search code examples
tensorflowkerasdeep-learninglstmconv-neural-network

CNN and LSTM for image captioning in Keras


I want to implement the following architecture in Keras for image captioning purpose but I am facing a lot of difficulties in connecting the output of CNN to the input of LSTM.

The structure of DL net

It is important to use the output of the CNN as the LSTM's input. Something like the following image. The structure 2

I can make an LSTM or CNN separately but this structure is what I don't know how to build. The image must be transformed into a feature description CNN and be inputted to the LSTM while the words of the caption in the vector representation insert into LSTM cells from the other way. This way cell number one is responsible for producing the first word and so on. I think both CNN and the LSTM must be trained at the same time.

By the way, it is not a school homework :)

Thanks in advance for your help.


Solution

  • I am assuming that you are familiar with the Tensorflow Keras API. I will implement the code in the following way.

    Assumptions: Vocab_size = 4000 and input_image_size = (572,572,3).

    vocab_size = 4000
    
    inputs = layers.Input(shape=(572, 572, 3))
    
    c0 = layers.Conv2D(64, activation='relu', kernel_size=3)(inputs)
    c1 = layers.Conv2D(64, activation='relu', kernel_size=3)(c0)  # This layer for concatenating in the expansive part
    c2 = layers.MaxPool2D(pool_size=(2, 2), strides=(2, 2), padding='valid')(c1)
    
    c3 = layers.Conv2D(128, activation='relu', kernel_size=3)(c2)
    c4 = layers.Conv2D(128, activation='relu', kernel_size=3)(c3)  # This layer for concatenating in the expansive part
    c5 = layers.MaxPool2D(pool_size=(2, 2), strides=(2, 2), padding='valid')(c4)
    
    c6 = layers.Conv2D(256, activation='relu', kernel_size=3)(c5)
    c7 = layers.Conv2D(256, activation='relu', kernel_size=3)(c6)  # This layer for concatenating in the expansive part
    c8 = layers.MaxPool2D(pool_size=(2, 2), strides=(2, 2), padding='valid')(c7)
    
    c9 = layers.Conv2D(512, activation='relu', kernel_size=3)(c8)
    c10 = layers.Conv2D(512, activation='relu', kernel_size=3)(c9)  # This layer for concatenating in the expansive part
    c11 = layers.MaxPool2D(pool_size=(2, 2), strides=(2, 2), padding='valid')(c10)
    
    fc1 = layers.Dense(4096)(c11)
    fc2 = layers.Dense(4096)(fc1)
    
    reshape = layers.Reshape((64, 4096))(fc2)
    
    rnn1 = layers.LSTM(64, return_sequences=True)(reshape)
    rnn2 = layers.LSTM(64)(rnn1)
    
    outputs = layers.Dense(vocab_size, activation='softmax')(rnn2)
    
    model = tf.keras.Model(inputs=inputs, outputs=outputs, name="caption_generate")
    
    
    model.summary()
    

    The important part here is to actually reshape your output from 4 dimensions to 3 dimensions. As the LSTM needs input in 3 dimensions

    reshape = layers.Reshape((64, 4096))(fc2)
    

    The following code works and you should be able to use it. I hope the answer serves you well.