tensorflow machine-learning keras deep-learning transformer-model

Model's predictions always 0

I have a training set of shape (1280, 100, 20, 4096) which I feed to a transformer-based model for binary classification (labels are either 0 or 1). This results in an big amount of features that I'm struggling to handle (I've tried to feed it to the model in batches, but I'm not sure about the best approach. Right now I just reduced it to (450, 100, 20, 4096), but any suggestion is appreciated), but my problem at the moment is that no matter how many epochs I train my model on, the accuracy will always be of 67,5% (which is the percentage of 0-labeled features in the test set), precision and recall on the test set will always be 0%. I've tried to normalize my data before feeding it to the model:

    scaler = StandardScaler()
    train_data = scaler.fit_transform(train_data.reshape(-1, train_data.shape[-1])).reshape(train_data.shape)
    test_data = scaler.transform(test_data.reshape(-1, test_data.shape[-1])).reshape(test_data.shape)

but this didn't result in any improvement. The model I'm using is based on an encoder-only transformer:

Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         [(None, 100, 20, 4096)]   0         
_________________________________________________________________
frame_position_embedding (Po (None, 100, 20, 4096)     8192000   
_________________________________________________________________
transformer_layer (Encoder)  (None, 100, 20, 4096)     134299652 
_________________________________________________________________
global_max_pooling (GlobalMa (None, 4096)              0         
_________________________________________________________________
dropout (Dropout)            (None, 4096)              0         
_________________________________________________________________
output (Dense)               (None, 1)                 4097      
=================================================================
Total params: 142,495,749
Trainable params: 142,495,749
Non-trainable params: 0
_________________________________________________________________

During training, I can see that loss, accuracy, precision and recall reach decent levels, but when I evaluate the model on the test set all these values are as I previously described:

Epoch 100/100
29/29 [==============================] - 90s 3s/step - loss: 0.0839 - accuracy: 0.9610 - recall: 0.9316 - precision: 0.9589
2024-02-06 12:38:38.815759: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 9175040000 exceeds 10% of free system memory.
9/9 [==============================] - 21s 2s/step - loss: 9.4117 - accuracy: 0.6750 - recall: 0.0000e+00 - precision: 0.0000e+00
Test accuracy: 67.5%
Test recall: 0.0%
Test precision: 0.0%

The model optimizer is adam, the loss is binary crossentropy. Activation is sigmoid. I'm struggling in finding an adeguate tuning for the model, and even to understand its current behavior. In addition, it's not clear to me if feeding the sets in batches both to the scaler and the fit functions would change the actual training of the model.

Solution

It sounds like the model is adapting the training set, but not generalising to the test set. This is overfitting behaviour, which seems likely to happen because you have relatively few samples compared to the model size (1300 samples vs 140 million parameters).

To help the net generalise better, without modifying your existing model too much, here are some ideas:

Use AdamW rather than Adam, and set its regularisation weight_decay parameter to a large value. This will effectively shrink the number of parameters the model is allowed to easily tap into.
Using a smaller batch size will regularise the training. Start with 2 or 4, and experiment with doubling the batch size until you find a spot where it's both reasonably quick to train, and also has decent metrics. Each time you increase the batch size, consider decreasing the learning rate.
Reduce the size of the transformer layer. Also try reducing the embedding size or model size of the embedding layer.
Add dropout layers at each stage. Start with just one and monitor the change as you add more dropout layers.

Usually worth tweaking the learning rate after each change, especially if the net becomes unstable.

Using early stopping in conjunction with the above will likely also help. I'd start with some of the above points, otherwise the model might be overfitting almost immediately.

It may also help to preprocess your data down to fewer/smaller features using sklearn or a neural network.

Try one thing at a time and observe its effect on the validation set metric(s). I wouldn't focus too much on the test set accuracy for now, because it's a biased metric in favour of the majority class. Worth looking more at recall and precision initially (or use the F1 score which combines them into a single number).

The recall and precision should improve as the model becomes able to discern positive cases (recall) and does so with precision. The train score will go down at the same time, and these trends mean the model is generalising better.