Keras Transformers - Dimensions must be equal

I wanted to do NER with keras model using transformers. The example was working correctly but I wanted to add some context to each words in order to help the model being more accurate. What I mean by context is "coordinate X", "coordinate Y", "width of the word", "height of the word", "page index", ... For example some informations are usually on the top right corner of a document so having the coordinate of the word might help (I'm new to ML so feel free to tell me I'm wrong if it's the case).

In order to have this "context" I've transformed the x_train and x_val in this format:

[
    [
        [pageIndex, wordVocabId, x, y, width, height, ocrScore],
        [pageIndex, wordVocabId, x, y, width, height, ocrScore],
        ...
    ],
    [
        [pageIndex, wordVocabId, x, y, width, height, ocrScore],
        [pageIndex, wordVocabId, x, y, width, height, ocrScore],
        ...
    ],
    ...
]

Where each array of 2nd level represent a document and each array of 3nd level represent a word with its context. The 3nd level array is a numpy array of numbers.

Even if I tried to edit the model to make it working I don't think I went in the right direction so I'll post here the model from the example of keras that I try to use and that I would like to adapt to my usecase:

    class TransformerBlock(layers.Layer):
        def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
            super().__init__()
            self.att = keras.layers.MultiHeadAttention(
                num_heads=num_heads, key_dim=embed_dim
            )
            self.ffn = keras.Sequential(
                [
                    keras.layers.Dense(ff_dim, activation="relu"),
                    keras.layers.Dense(embed_dim),
                ]
            )
            self.layernorm1 = keras.layers.LayerNormalization(epsilon=1e-6)
            self.layernorm2 = keras.layers.LayerNormalization(epsilon=1e-6)
            self.dropout1 = keras.layers.Dropout(rate)
            self.dropout2 = keras.layers.Dropout(rate)

        def call(self, inputs, training=False):
            attn_output = self.att(inputs, inputs)
            attn_output = self.dropout1(attn_output, training=training)
            out1 = self.layernorm1(inputs + attn_output)
            ffn_output = self.ffn(out1)
            ffn_output = self.dropout2(ffn_output, training=training)
            return self.layernorm2(out1 + ffn_output)
        

    class TokenAndPositionEmbedding(layers.Layer):
        def __init__(self, maxlen, vocab_size, embed_dim):
            super().__init__()
            self.token_emb = keras.layers.Embedding(
                input_dim=vocab_size, output_dim=embed_dim
            )
            self.pos_emb = keras.layers.Embedding(input_dim=maxlen, output_dim=embed_dim)

        def call(self, inputs):
            maxlen = tf.shape(inputs)[-1]
            positions = tf.range(start=0, limit=maxlen, delta=1)
            position_embeddings = self.pos_emb(positions)
            token_embeddings = self.token_emb(inputs)
            return token_embeddings + position_embeddings

    class NERModel(keras.Model):
        def __init__(
            self, num_tags, vocab_size, maxlen=128, embed_dim=32, num_heads=2, ff_dim=32
        ):
            super().__init__()
            self.embedding_layer = TokenAndPositionEmbedding(maxlen, vocab_size, embed_dim)
            self.transformer_block = TransformerBlock(embed_dim, num_heads, ff_dim)
            self.dropout1 = layers.Dropout(0.1)
            self.ff = layers.Dense(ff_dim, activation="relu")
            self.dropout2 = layers.Dropout(0.1)
            self.ff_final = layers.Dense(num_tags, activation="softmax")

        def call(self, inputs, training=False):
            x = self.embedding_layer(inputs)
            x = self.transformer_block(x)
            x = self.dropout1(x, training=training)
            x = self.ff(x)
            x = self.dropout2(x, training=training)
            x = self.ff_final(x)
            return x

Source: https://keras.io/examples/nlp/ner_transformers/

I try to compile and fit this way:

    print(len(tag_mapping), vocab_size, len(x_train), len(y_train))
    model = NERModel(len(tag_mapping), vocab_size, embed_dim=32, num_heads=4, ff_dim=64)
    model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.fit(tf.convert_to_tensor(x_train), tf.convert_to_tensor(y_train), validation_data=(x_val, y_val), epochs=10)
    model.save("model.keras")

The result of the print is (I have only 3 tags for now because I first try to make the model working):

3 20000 1000 1000

The format of my y_train is the follow:

[
    [tagId_document1_word1, tagId_document1_word2, ...],
    [tagId_document2_Word1, tagId_document2_word1, ...]
]

When I run model.fit I have this error:

 ValueError: Dimensions must be equal, but are 516 and 7 for '{{node Equal}} = Equal[T=DT_FLOAT, incompatible_shape_error=true](Cast_1, Cast_2)' with input shapes: [?,516], [?,516,7].

I hope with all these informations someone can pin me in the right direction because I'm a bit lost here.

Thank you.

Solution

In order to train models with massive data batches, matrix multiplication is used.

From the error it seems that there is a mismatch between matrix shapes.

So, for matrices to be multiplied, they need to have the same dimensions but inverted.

Like the golden rule: \

The number of columns in the first matrix (A) must be equal to the number of rows in the second matrix (B)

For example:-

Matrix A shape = (7, 512)
Matrix B shape = (512, 8)

That means that the output will be a 7x8 dimensioned matrix.

Also they need to be on the same dimension.

Matrix A shape = (7, 512)
Matrix B shape = (512, 7, 3)

So this will not me compatible, because there is an extra dimension in Matrix B.

Revise the matrices that you have, you a matrix which is (?, 512) and (?, 512, 7)