How to add additional data to CNN+LSTM network

I have this following network (pretrained CNN + LSTM to classify videos):

 frames, channels, rows, columns = 5,3,224,224

  video = Input(shape=(frames,
                      rows,
                      columns,
                      channels))
  
  cnn_base = VGG16(input_shape=(rows,
                                columns,
                                channels),
                  weights="imagenet",
                  include_top=True) #<=== include_top=True
  cnn_base.trainable = False

  cnn = Model(cnn_base.input, cnn_base.layers[-3].output, name="VGG_fm") # -3 is the 4096 layer
  encoded_frames = TimeDistributed(cnn , name = "encoded_frames")(video)
  encoded_sequence = LSTM(256, name = "encoded_seqeunce")(encoded_frames)
  hidden_layer = Dense(1024, activation="relu" , name = "hidden_layer")(encoded_sequence)
  outputs = Dense(10, activation="softmax")(hidden_layer)

  model = Model(video, outputs)

That looks like this:

Now, I want to add a 1D vector of 784 features of the video to the last layer. I tried to replace the last two rows with:

  encoding_input = keras.Input(shape=(784,), name="Encoding", dtype='float') 
  sentence_features = layers.Dense(units = 60, name = 'sentence_features')(encoding_input)
  x = layers.concatenate([sentence_features, hidden_layer])
  outputs = Dense(10, activation="softmax")(x)

But got the error:

ValueError: Graph disconnected: cannot obtain value for tensor Tensor("Sentence-Input-Encoding_3:0", shape=(None, 784), dtype=float32) at layer "sentence_features". The following previous layers were accessed without issue: ['encoded_frames', 'encoded_seqeunce']

Any suggestions:

Solution

your network now has two inputs... don't forget to pass both to your model

model = Model([video,encoding_input], outputs)

full example

frames, channels, rows, columns = 5,3,224,224

video = Input(shape=(frames,
                  rows,
                  columns,
                  channels))

cnn_base = VGG16(input_shape=(rows,
                            columns,
                            channels),
              weights="imagenet",
              include_top=True)
cnn_base.trainable = False

cnn = Model(cnn_base.input, cnn_base.layers[-3].output, name="VGG_fm")
encoded_frames = TimeDistributed(cnn , name = "encoded_frames")(video)
encoded_sequence = LSTM(256, name = "encoded_seqeunce")(encoded_frames)
hidden_layer = Dense(1024, activation="relu" , name = "hidden_layer")(encoded_sequence)

encoding_input = Input(shape=(784,), name="Encoding", dtype='float') 
sentence_features = Dense(units = 60, name = 'sentence_features')(encoding_input)
x = concatenate([sentence_features, hidden_layer])
outputs = Dense(10, activation="softmax")(x)

model = Model([video,encoding_input], outputs) #<=== double input
model.summary()