Combine two different shaped inputs in Tensorflow, combine images and landmark coordinates

I am currently getting frustrated trying to combine two different shapes input layers that I want to give my model as input.

What I have:

I have the following two inputs

X_train # shape (120, 224, 224, 1)
landmarks_x_train # shape (120, 478, 3)
X_val # shape (40, 224, 224, 1)
landmarks_x_val # shape (40, 478, 3)

So in this example I have 120 images that are grayscale and have a size of (224, 224) and they all have one landmark “set” with 478 landmarks that have x, y, z coordinates.

The number 120 is just an example, the real dataset has way more images and landmarks for each image.

As a model, I have built a ResNet50 by myself with input_shape=(224, 224, 1).

And the output of x = Dense(7, activation='softmax')(x)

Before I train the model, I create a ImageDataGenerator flow like:

datagen = ImageDataGenerator(horizontal_flip=True, fill_mode='nearest')
datagen.fit(X_train_with_landmarks)

batch_size = 16
train_flow = datagen.flow(X_train, y_train, batch_size=batch_size)
val_flow = datagen.flow(X_val, y_val, batch_size=batch_size)

My training steps are like:

model = ResNet.get_resnet_50_model() # my class where the model is located

optimizer = Adam(learning_rate=0.01)

model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])

num_epochs = 5

history = model.fit(train_flow,
                    steps_per_epoch=len(X_train) / batch_size,
                    epochs=num_epochs,
                    verbose=2,
                    validation_data=val_flow,
                    validation_steps=len(X_val) / batch_size)

Where the problem is:

I now wanted to combine those two inputs that I have to build a better model that doesn't just rely on the images like it does now.

I have tried several things I found on the web and also asked ChatGPT but without luck.

The most promising way was two combine those two with a Keras Concatenate layer, like this:

model = ResNet.get_resnet_50_model()

landmarks_input = Input(shape=(landmarks_x_train.shape[1],), name='landmarks_input')

model_output = model.output

combined_input = concatenate([model_output, landmarks_input], name='combined_input')

model = Model(inputs=[model.input, landmarks_input], outputs=combined_input)

This gave me a model, but I was unable to adapt the model.fit() process to get it running.

Conclusion:

So now I hope someone can help me combine those two inputs, so I can train the model on both of them.

Solution

In keras mixed data and multiple inputs can be integrated using the keras function API.

From an architectural point of view you will be introducing two input streams, into a dense layer and then you will be concatenating these input streams.

# define two sets of inputs
inputA = Input(shape=(32,))
inputB = Input(shape=(128,))

# the first branch operates on the first input
x = Dense(8, activation="relu")(inputA)
x = Dense(4, activation="relu")(x)
x = Model(inputs=inputA, outputs=x)

# the second branch opreates on the second input
y = Dense(64, activation="relu")(inputB)
y = Dense(32, activation="relu")(y)
y = Dense(4, activation="relu")(y)
y = Model(inputs=inputB, outputs=y)

# combine the output of the two branches
combined = concatenate([x.output, y.output])

# apply a FC layer and then a regression prediction on the
# combined outputs
z = Dense(2, activation="relu")(combined)
z = Dense(1, activation="linear")(z)

# our model will accept the inputs of the two branches and
# then output a single value
model = Model(inputs=[x.input, y.input], outputs=z)

A full tutorial can be found here.

Note: If your landmarks are within the image dimensions, you could also generate an additional channel to the image where pixels with a landmark get the associated depth / z value. Your input is then an image with two channels.