python keras classification similarity siamese-network

Siamese Network for binary classification with pre-encoded inputs

I want to train a Siamese Network to compare vectors for similarity.

My dataset consist of pairs of vectors and a target column with "1" if they are the same and "0" otherwise (binary classification):

import pandas as pd

# Define train and test sets.
X_train_val = pd.read_csv("train.csv")
print(X_train_val.head())

y_train_val = X_train_val.pop("class")
print(y_train_val.value_counts())

# Keep 50% of X_train_val in validation set.
X_train, X_val = X_train_val[:991], X_train_val[991:]
y_train, y_val = y_train_val[:991], y_train_val[991:]
del X_train_val, y_train_val

# Split our data to 'left' and 'right' inputs (one for each side Siamese).
X_left_train, X_right_train = X_train.iloc[:, :200], X_train.iloc[:, 200:]
X_left_val, X_right_val = X_val.iloc[:, :200], X_val.iloc[:, 200:]

assert X_left_train.shape == X_right_train.shape

# Repeat for test set.
X_test = pd.read_csv("test.csv")
y_test = X_test.pop("class")

print(y_test.value_counts())

X_left_test, X_right_test = X_test.iloc[:, :200], X_test.iloc[:, 200:]

returns

         v0        v1        v2  ...       v397      v398      v399  class
0  0.003615  0.013794  0.030388  ...  -0.093931  0.106202  0.034870    0.0
1  0.018988  0.056302  0.002915  ...  -0.007905  0.100859 -0.043529    0.0
2  0.072516  0.125697  0.111230  ...  -0.010007  0.064125 -0.085632    0.0
3  0.051016  0.066028  0.082519  ...   0.012677  0.043831 -0.073935    1.0
4  0.020367  0.026446  0.015681  ...   0.062367 -0.022781 -0.032091    0.0

1.0    1060
0.0     923
Name: class, dtype: int64

1.0     354
0.0     308
Name: class, dtype: int64

The rest of my script is as follows:

import keras
import keras.backend as K
from keras.layers import Dense, Dropout, Input, Lambda
from keras.models import Model


def euclidean_distance(vectors):
    """
    Find the Euclidean distance between two vectors.
    """
    x, y = vectors
    sum_square = K.sum(K.square(x - y), axis=1, keepdims=True)
    # Epsilon is small value that makes very little difference to the value of the denominator, but ensures that it isn't equal to exactly zero.
    return K.sqrt(K.maximum(sum_square, K.epsilon()))


def contrastive_loss(y_true, y_pred):
    """
    Distance-based loss function that tries to ensure that data samples that are semantically similar are embedded closer together.

    See:
    * https://gombru.github.io/2019/04/03/ranking_loss/
    """
    margin = 1
    return K.mean(y_true * K.square(y_pred) + (1 - y_true) * K.square(K.maximum(margin - y_pred, 0)))


def accuracy(y_true, y_pred):
    """
    Compute classification accuracy with a fixed threshold on distances.
    """
    return K.mean(K.equal(y_true, K.cast(y_pred < 0.5, y_true.dtype)))


def create_base_network(input_dim: int, dense_units: int, dropout_rate: float):
    input1 = Input(input_dim, name="encoder")
    x = input1
    x = Dense(dense_units, activation="relu")(x)
    x = Dropout(dropout_rate)(x)
    x = Dense(dense_units, activation="relu")(x)
    x = Dropout(dropout_rate)(x)
    x = Dense(dense_units, activation="relu", name="Embeddings")(x)
    return Model(input1, x)


def build_siamese_model(input_dim: int):
    shared_network = create_base_network(input_dim, dense_units=128, dropout_rate=0.1)

    left_input = Input(input_dim)
    right_input = Input(input_dim)

    # Since this is a siamese nn, both sides share the same network.
    encoded_l = shared_network(left_input)
    encoded_r = shared_network(right_input)

    # The euclidean distance layer outputs close to 0 value when two inputs are similar and 1 otherwise.
    distance = Lambda(euclidean_distance, name="Euclidean-Distance")([encoded_l, encoded_r])

    siamese_net = Model(inputs=[left_input, right_input], outputs=distance)
    siamese_net.compile(loss=contrastive_loss, optimizer="RMSprop", metrics=[accuracy])

    return siamese_net


model = build_siamese_model(X_left_train.shape[1])

es_callback = keras.callbacks.EarlyStopping(monitor="val_loss", patience=3, verbose=0)
history = model.fit(
    [X_left_train, X_right_train],
    y_train,
    validation_data=([X_left_val, X_right_val], y_val),
    epochs=100,
    callbacks=[es_callback],
    verbose=1,
)

I have plotted the contrastive loss vs epoch and model accuracy vs epoch:

The validation line is almost flat, which seems odd to me (overfitted?).

After changing the dropout of the shared network from 0.1 to 0.5, I get the following results:

Somehow it looks better, but yields bad predictions as well.

My questions are:

Most examples of Siamese Networks I've seen so far involves embedding layers (text pairs) and/or Convolution layers (image pairs). My input pairs are the actual vector representation of some text, which is why I used Dense layers for the shared network. Is this the proper approach?

The output layer of my Siamese Network is as follows:

distance = Lambda(euclidean_distance, name="Euclidean-Distance")([encoded_l, encoded_r])
siamese_net = Model(inputs=[left_input, right_input], outputs=distance)
siamese_net.compile(loss=contrastive_loss, optimizer="RMSprop", metrics=[accuracy])

but someone over the internet suggested this instead:

distance = Lambda(lambda tensors: K.abs(tensors[0] - tensors[1]), name="L1-Distance")([encoded_l, encoded_r])
output = Dense(1, activation="sigmoid")(distance)  # returns the class probability
siamese_net = Model(inputs=[left_input, right_input], outputs=output)
siamese_net.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

I'm not sure which one to trust nor the difference between them (except that the former returns the distance and the latter returns the class probability). In my experiments, I get poor results with binary_crossentropy.

EDIT:

After following @PlzBePython suggestions, I come up with the following base network:

distance = Lambda(lambda tensors: K.abs(tensors[0] - tensors[1]), name="L1-Distance")([encoded_l, encoded_r])
output = Dense(1, activation="linear")(distance)
siamese_net = Model(inputs=[left_input, right_input], outputs=output)
siamese_net.compile(loss=contrastive_loss, optimizer="RMSprop", metrics=[accuracy])

Thank you for your help!

Solution

This is less of an answer and more writing my thoughts down and hoping they can help find an answer.

In general, everything you do seems pretty reasonable to me. Regarding your Questions:

Embedding or feature extraction layers are never a must, but almost always make it easier to learn the intended. You can think of them like providing your distance model with the comprehensive summary of a sentence instead of its raw words. This also makes your model not depend on the location of a word. In your case, creating the summary/important features of a sentence and embedding similar sentences close to each other is done by the same network. Of course, this can also work, and I don't even think it's a bad approach. However, I would maybe increase the network size.

In my opinion, those two loss functions are not too different. Binary Crossentropy is defined as:

Binary Crossentropy

While Contrastive Loss (margin = 1) is:

So you basically swap a log function for a hinge function. The only real difference comes from the distance calculation. You probably got suggested using some kind of L1 distance, since L2 distance is supposed to perform worse with higher dimensions (see for example here) and your dimensionality is 128. Personally, I would rather go with L1 in your case, but I don't think it's a dealbreaker.

What I would try is:

increase the margin parameter. "1" always results in a pretty low loss in the false positive case. This could slow down training in general
try out embedding into the [-inf, inf] space (change last layer embedding activation to "linear")
change "binary_crossentropy" loss into "keras.losses.BinaryCrossentropy(from_logits=True)" and last activation from "sigmoid" to "linear". This should actually not make a difference, but I've made some weird experiences with the keras binary crossentropy function and from_logits seems to help sometimes
increase parameters

Lastly, a validation accuracy of 90% actually looks pretty good to me. Keep in mind, that when the validation accuracy is calculated in the first epoch, the model already has done about 60 weight updates (batch_size = 32). That means, especially in the first episode, a validation accuracy that is higher than the training accuracy (which is calculated during training) is kind of to be expected. Also, this can sometimes cause the misbelief that training loss is increasing faster than validation loss.

EDIT

I recommended "linear" in the last layer, because tensorflow recommends it ("from_logits"=True which requires value in [-inf, inf]) for Binary Crossentropy. In my experience, it converges better.