I want to implement the following architecture in Keras for image captioning purpose but I am facing a lot of difficulties in connecting the output of CNN to the input of LSTM.
It is important to use the output of the CNN as the LSTM's input. Something like the following image.
I can make an LSTM or CNN separately but this structure is what I don't know how to build. The image must be transformed into a feature description CNN and be inputted to the LSTM while the words of the caption in the vector representation insert into LSTM cells from the other way. This way cell number one is responsible for producing the first word and so on. I think both CNN and the LSTM must be trained at the same time.
By the way, it is not a school homework :)
Thanks in advance for your help.
I am assuming that you are familiar with the Tensorflow Keras API. I will implement the code in the following way.
Assumptions: Vocab_size = 4000
and input_image_size = (572,572,3)
.
vocab_size = 4000
inputs = layers.Input(shape=(572, 572, 3))
c0 = layers.Conv2D(64, activation='relu', kernel_size=3)(inputs)
c1 = layers.Conv2D(64, activation='relu', kernel_size=3)(c0) # This layer for concatenating in the expansive part
c2 = layers.MaxPool2D(pool_size=(2, 2), strides=(2, 2), padding='valid')(c1)
c3 = layers.Conv2D(128, activation='relu', kernel_size=3)(c2)
c4 = layers.Conv2D(128, activation='relu', kernel_size=3)(c3) # This layer for concatenating in the expansive part
c5 = layers.MaxPool2D(pool_size=(2, 2), strides=(2, 2), padding='valid')(c4)
c6 = layers.Conv2D(256, activation='relu', kernel_size=3)(c5)
c7 = layers.Conv2D(256, activation='relu', kernel_size=3)(c6) # This layer for concatenating in the expansive part
c8 = layers.MaxPool2D(pool_size=(2, 2), strides=(2, 2), padding='valid')(c7)
c9 = layers.Conv2D(512, activation='relu', kernel_size=3)(c8)
c10 = layers.Conv2D(512, activation='relu', kernel_size=3)(c9) # This layer for concatenating in the expansive part
c11 = layers.MaxPool2D(pool_size=(2, 2), strides=(2, 2), padding='valid')(c10)
fc1 = layers.Dense(4096)(c11)
fc2 = layers.Dense(4096)(fc1)
reshape = layers.Reshape((64, 4096))(fc2)
rnn1 = layers.LSTM(64, return_sequences=True)(reshape)
rnn2 = layers.LSTM(64)(rnn1)
outputs = layers.Dense(vocab_size, activation='softmax')(rnn2)
model = tf.keras.Model(inputs=inputs, outputs=outputs, name="caption_generate")
model.summary()
The important part here is to actually reshape your output from 4 dimensions
to 3 dimensions
. As the LSTM
needs input in 3 dimensions
reshape = layers.Reshape((64, 4096))(fc2)
The following code works and you should be able to use it. I hope the answer serves you well.