tensorflow keras conv-neural-network python-3.5 loss-function

Accuracy of contrastive loss function increases on training set, but validation accuracy gets worse or doesn't improve

I am trying to create a speaker recognition siamese neural network which takes two samples as input and figures out whether they are from the same speaker or not. To this end, I am using the contrastive loss function as described in a few sources I've checked (here and here).

I have a toy dataset which I trained a small model on (9500 training samples and 500 test samples). The accuracy of the training set increases up to 0.97, while validation accuracy increases up to 0.93. So far so good. However, when I try to apply the same configuration on the bigger dataset I get poor results; training accuracy increases, but validation loss never exceeds 0.5, which is as good as random guesses for a problem like this. Here is my code:

import numpy as np
import keras
import tensorflow as tf
from keras.models import Sequential, Model
from keras.layers import Dense, Activation, Flatten, Input, Concatenate, Lambda, merge
from keras.layers import Dropout
from keras.layers import LSTM, BatchNormalization
from keras.layers.convolutional import Conv2D
from keras.layers.convolutional import MaxPooling2D
from keras import backend as K

K.set_image_dim_ordering('tf')

def Siamese_Contrastive_Loss():
    filepath = 'C:/Users/User/Documents/snet.h5'
    X_1, X_2, x1_val, x2_val, Y, val_y = data_preprocessing_load()
    input_shape = (sample_length, features, 1)
    left_input = Input(input_shape)
    right_input = Input(input_shape)

    baseNetwork = createBaseNetworkSmaller(sample_length, features, 1)
    encoded_l = baseNetwork(left_input)
    encoded_r = baseNetwork(right_input)
    distance = Lambda(euclidean_distance,output_shape=eucl_dist_output_shape)([encoded_l, encoded_r])
    model = Model([left_input, right_input], distance)

    checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
    callbacks_list = [checkpoint]
    model.compile(loss=contrastive_loss, optimizer='rmsprop', metrics=[acc])
    model.fit([X_1,X_2], Y, validation_data=([x1_val, x2_val],val_y), epochs=20, batch_size=32, verbose=2, callbacks=callbacks_list)


def data_preprocessing_load():
    ...
    return X_1, X_2, x1_val, x2_val, Y, val_y



def createBaseNetworkSmaller(sample_length, features, ii):
    input_shape = (sample_length, features, ii)
    baseNetwork = Sequential()
    baseNetwork.add(Conv2D(64,(10,10),activation='relu',input_shape=input_shape))
    baseNetwork.add(MaxPooling2D(pool_size=3))
    baseNetwork.add(Conv2D(64,(5,5),activation='relu'))
    baseNetwork.add(MaxPooling2D(pool_size=1))
    #baseNetwork.add(BatchNormalization())
    baseNetwork.add(Flatten())
    baseNetwork.add(Dense(32, activation='relu'))
    #baseNetwork.add(Dropout(0.2))
    baseNetwork.add(Dense(32, activation='relu'))
    return baseNetwork

def euclidean_distance(vects):
    x, y = vects
    return K.sqrt(K.maximum(K.sum(K.square(x - y), axis=1, keepdims=True), K.epsilon()))


def eucl_dist_output_shape(shapes):
    shape1, shape2 = shapes
    return (shape1[0], 1)


def contrastive_loss(y_true, y_pred):
    '''Contrastive loss from Hadsell-et-al.'06
    http://yann.lecun.com/exdb/publis/pdf/hadsell-chopra-lecun-06.pdf
    '''
    margin = 1
    square_pred = K.square(y_pred)
    margin_square = K.square(K.maximum(margin - y_pred, 0))
    #return K.mean(y_true * square_pred + (1 - y_true) * margin_square)
    return K.mean((1 - y_true) * K.square(y_pred) + y_true * K.square(K.maximum(margin - y_pred, 0)))


def acc(y_true, y_pred):
    ones = K.ones_like(y_pred)
    return K.mean(K.equal(y_true, ones - K.clip(K.round(y_pred), 0, 1)), axis=-1)

I think that the problem lies in the fact that I do not know exactly what contrastive loss is supposed to be doing. I have a subset of positive pairs (samples from the same speaker) marked as 0 and another subset of negative pairs (samples from different speakers) marked as 1. As I understand it, the idea is to try to maximize the distance between the negative pairs and minimize it between the positive ones. I am not sure if this is the case here. The function named 'acc' is determining the accuracy at each step of the training. The function named 'contrastive_loss' is the main loss function, where I've put two return statements, one being commented out. I've read in a forum that depending on how one has marked their positive and negative pairs (0/1 or 1/0 respectively) they should use the corresponding formula. At this point I am confused. What configuration should I use? Are the positive pairs supposed to be 0s and the negatives 1s or vice versa? And lastly, what should the contrastive loss be like?

Solution

how do you have the audio samples and labels (0 for different people or 1 for the same people) of the pairs. I would advise you to use sigmoid function in the last layer, with 2 neurons. That way you would use the "binary_crossentropy" loss function. And the output of your network would be a value between 0 and 1, where zero would be the biggest difference between 2 audio samples and 1 the biggest similarity.

`def createBaseNetworkSmaller(sample_length, features, ii):
    input_shape = (sample_length, features, ii)
    baseNetwork = Sequential()
    baseNetwork.add(Conv2D(64,(10,10),activation='relu',input_shape=input_shape))
    baseNetwork.add(MaxPooling2D(pool_size=3))
    baseNetwork.add(Conv2D(64,(5,5),activation='relu'))
    baseNetwork.add(MaxPooling2D(pool_size=1))
    #baseNetwork.add(BatchNormalization())
    baseNetwork.add(Flatten())
    baseNetwork.add(Dense(32, activation='relu'))
    #baseNetwork.add(Dropout(0.2))
    baseNetwork.add(Dense(2, activation='sigmoid'))
    return baseNetwork`

`model.compile(loss=contrastive_loss, optimizer='rmsprop', metrics=[acc])`