python tensorflow keras deep-learning transformer-model

Multi-instance classification using tranformer model

I use the transformer from this Keras documentation example for multi-instance classification. The class of each instance depends on other instances that come in one bag. I use transformer model because:

It makes no assumptions about the temporal/spatial relationships across the data. This is ideal for processing a set of objects

For example, each bag may have maximal 5 instances and there are 3 features per instance.

# Generate data
max_length = 5
x_lst = []
y_lst = []
for _ in range(10):
    num_instances = np.random.randint(2, max_length + 1)
    x_bag = np.random.randint(0, 9, size=(num_instances, 3))
    y_bag = np.random.randint(0, 2, size=num_instances)
    
    x_lst.append(x_bag)
    y_lst.append(y_bag)

Features and labels of first 2 bags (with 5 and 2 instances):

x_lst[:2]

[array([[8, 0, 3],
        [8, 1, 0],
        [4, 6, 8],
        [1, 6, 4],
        [7, 4, 6]]),
 array([[5, 8, 4],
        [2, 1, 1]])]

y_lst[:2]

[array([0, 1, 1, 1, 0]), array([0, 0])]

Next, I pad features with zeros and targets with -1:

x_padded = []
y_padded = []

for x, y in zip(x_lst, y_lst):
    x_p = np.zeros((max_length, 3))
    x_p[:x.shape[0], :x.shape[1]] = x
    x_padded.append(x_p)

    y_p = np.negative(np.ones(max_length))
    y_p[:y.shape[0]] = y
    y_padded.append(y_p)

X = np.stack(x_padded)
y = np.stack(y_padded)

where X.shape is equal to (10, 5, 3) and y.shape is equal to (10, 5).

I made two changes to the original model: added the Masking layer after the Input layer and set the number of units in the last Dense layer to the maximal size of the bag (plus 'sigmoid' activation):

def transformer_encoder(inputs, head_size, num_heads, ff_dim, dropout=0):
    # Attention and Normalization
    x = layers.MultiHeadAttention(
        key_dim=head_size, num_heads=num_heads, dropout=dropout
    )(inputs, inputs)
    x = layers.Dropout(dropout)(x)
    x = layers.LayerNormalization(epsilon=1e-6)(x)
    res = x + inputs

    # Feed Forward Part
    x = layers.Conv1D(filters=ff_dim, kernel_size=1, activation="relu")(res)
    x = layers.Dropout(dropout)(x)
    x = layers.Conv1D(filters=inputs.shape[-1], kernel_size=1)(x)
    x = layers.LayerNormalization(epsilon=1e-6)(x)
    return x + res

def build_model(
    input_shape,
    head_size,
    num_heads,
    ff_dim,
    num_transformer_blocks,
    mlp_units,
    dropout=0,
    mlp_dropout=0,
):
    inputs = keras.Input(shape=input_shape)
    inputs = keras.layers.Masking(mask_value=0)(inputs) # ADDED MASKING LAYER
    x = inputs
    for _ in range(num_transformer_blocks):
        x = transformer_encoder(x, head_size, num_heads, ff_dim, dropout)

    x = layers.GlobalAveragePooling1D(data_format="channels_first")(x)
    for dim in mlp_units:
        x = layers.Dense(dim, activation="relu")(x)
        x = layers.Dropout(mlp_dropout)(x)
    outputs = layers.Dense(5, activation='sigmoid')(x) # CHANGED ACCORDING TO MY OUTPUT
    return keras.Model(inputs, outputs)

input_shape = (5, 3)

model = build_model(
    input_shape,
    head_size=256,
    num_heads=4,
    ff_dim=4,
    num_transformer_blocks=4,
    mlp_units=[128],
    mlp_dropout=0.4,
    dropout=0.25,
)

model.compile(
    loss="binary_crossentropy",
    optimizer=keras.optimizers.Adam(learning_rate=1e-4),
    metrics=["binary_accuracy"],
)
model.summary()

It looks like my model doesn't learn much. If I use the number of true values for each bag (y.sum(axis=1) and Dense(1)) as a target instead of classification of each instance, the model learns good. Where is my error? How should I build the output layer in this case? Do I need a custom lost function?

UPDATE: I made a custom loss function:

def my_loss_fn(y_true, y_pred):
    mask = tf.cast(tf.math.not_equal(y_true, tf.constant(-1.)), tf.float32)
    y_true, y_pred = tf.expand_dims(y_true, axis=-1), tf.expand_dims(y_pred, axis=-1)
    bce = tf.keras.losses.BinaryCrossentropy(reduction='none')
    return tf.reduce_sum(tf.cast(bce(y_true, y_pred), tf.float32) * mask)

mask = (y_test != -1).astype(int)
pd.DataFrame({'n_labels': mask.sum(axis=1), 'preds': ((preds * mask) >= .5).sum(axis=1)}).plot(figsize=(20, 5))

And it looks like the model learns:

But it predicts all nonmasked labels as 1.

@thushv89 This is my problem. I take 2 time points: t1 and t2 and look for all vehicles that are in maintenance at the time t1 and for all vehicles that are planned to be in maintenance at the time t2. So, this is my bag of items. Then I calculate features like how much time t1 vehicles have already spent in maintenance, how much time from t1 to the plan start for t2 vehicle etc. My model learns well if I try to predict the number of vehicles in maintenance at the time t2, but I would like to predict which of them will leave and which of them will come in (3 vs [True, False, True, True] for 4 vehicles in the bag).

Solution

There are three important improvements:

Replace the GlobalAveragePooling1D layer with the Flatten layer.
Add a custom loss function to exclude target padding from calculation (already added to my question) and a custom metric function if you want to see the real metric.
Add an attention_mask to the MultiHeadAttention (instead of Masking layer) to mask the padding.