python deep-learning neural-network pytorch siamese-network

Need to implement Deep Learning architecture quite similar to Siamese Network

I must implement this network:

Similar to a siamese network with a contrastive loss. My problem is S1/F1. The paper tells this:

"F1 and S1 are neural networks that we use to learn the unit-normalized embeddings for the face and speech modalities, respectively. In Figure 1, we depict F1 and S1 in both training and testing routines. They are composed of 2D convolutional layers (purple), max-pooling layers (yellow), and fully connected layers (green). ReLU non-linearity is used between all layers. The last layer is a unit-normalization layer (blue). For both face and speech modalities, F1 and S1 return 250-dimensional unit-normalized embeddings".

My question is:

How can apply a 2D convolutional layer (purple) to input with shape (number of videos, number of frames, features)?
What is the last layer? Batch norm? F.normalize?

Solution

I will give an answer to your two questions without going too much into details:

If you're working with a CNN, you're most likely having spatial information in your input, that is your input is a two dimensional multi-channel tensor (*, channels, height, width), not a feature vector (*, features). You simply won't be able to apply a convolution on your input (at least a 2D conv), if you don't retain two-dimensionality.
The last layer is described as a "unit-normalization" layer. This is merely the operation of making the vector's norm unit (equal to 1). You can do this by dividing the said vector by its norm.