keras neural-network word2vec similarity word-embedding

Keras Semantic Similarity model from pre-trained embeddings

I want to implement a Keras model to predict the similarity between two sentences from words embeddings as follows (I included my full script at the end):

Load words embeddings models, e.g., Word2Vec and fastText.
Generate samples (X1 and X2) by computing the average word vectors for all words in a sentence. If two or more models are used, calculate the arithmetic mean of all embeddings (Frustratingly Easy Meta-Embedding -- Computing Meta-Embeddings by Averaging Source Word Embeddings).
Concatenate X1 and X2 into one array before feeding them to the network.
Compile (and evaluate) the Keras model.

The entire script is as follows:

import numpy as np
from gensim.models import Word2Vec
from keras.layers import Dense
from keras.models import Sequential
from sklearn.model_selection import train_test_split


def encoder_vector(v: str, model: Word2Vec) -> np.array:
    wv_dim = model.vector_size
    if v in model.wv:
        return model.wv[v]
    else:
        return np.zeros(wv_dim)


def encoder_words_avg(words: list[str], model: Word2Vec) -> np.array:
    dim = model.vector_size
    words = [word for word in words if word in model.wv]
    if len(words) >= 1:
        return np.mean(model.wv[words], axis=0)
    else:
        return np.zeros(dim)


def load_samples(mappings, w2v_model, fast_model):
    dim = w2v_model.vector_size
    num = len(mappings)

    X1 = np.zeros((num, dim))
    X2 = np.zeros((num, dim))
    y = np.zeros((num, 1))

    for i in range(num):
        mapping = mappings[i].split("|")
        sentence_1, sentence_2 = mapping[1:]

        e = np.zeros((2, dim))

        # Compute meta-embedding by averaging all embeddings.
        e[0, :] = encoder_words_avg(words=sentence_1.split(), model=w2v_model)
        e[1, :] = encoder_words_avg(words=sentence_1.split(), model=fast_model)
        X1[i] = e.mean(axis=0)

        e[0, :] = encoder_words_avg(words=sentence_2.split(), model=w2v_model)
        e[1, :] = encoder_words_avg(words=sentence_2.split(), model=fast_model)
        X2[i] = e.mean(axis=0)

        y[i] = 0.0 if mapping[0].startswith("-") else 1.0

    return X1, X2, y


def baseline_model(X_train, X_test, y_train, y_test):
    model = Sequential()
    model.add(
        Dense(
            200,
            input_shape=(X_train.shape[1],),
            activation="relu",
            kernel_initializer="he_uniform",
        )
    )
    model.add(Dense(1, activation="sigmoid"))
    model.compile(optimizer="sgd", loss="binary_crossentropy", metrics=["accuracy"])
    model.fit(X_train, y_train, batch_size=8, epochs=14)

    # Evaluate the trained model, using the train and test data
    _, train_acc = model.evaluate(X_train, y_train, verbose=0)
    _, test_acc = model.evaluate(X_test, y_test, verbose=0)

    print("Train: %.3f, Test: %.3f\n" % (train_acc, test_acc))

    return model


def main():
    w2v_model = Word2Vec.load("")
    fast_model = Word2Vec.load("")

    mappings = [
        "1|boiled chicken egg|hen egg whole boiled",
        "2|tomato|tomato substance",
        "3|sweet potatoes|potato chip",
        "-1|watering plants|cornsalad plant",
        "-2|butter|butane",
        "-3|olive plant|black olives",
    ]

    X1, X2, y = load_samples(mappings, w2v_model=w2v_model, fast_model=fast_model)

    # Concatenate both arrays into one before feeding to the network.
    X = np.concatenate([X1, X2], axis=1)

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )

    model = baseline_model(X_train, X_test, y_train, y_test)

    model.summary()

The above script seems to work, but the prediction result is very poor even when using only Word2Vec (which makes me think there could be an issue with the Keras model...). Any ideas on how to improve the outcome? Am I doing something wrong?

Thank you.

Solution

It's unclear what you're intending to predict.

Do you want your Keras NN to report the same value as the precise cosine-similarity calculation, between the two text summary vectors, would report? If so, why not just... do the calculation? It's not something I'd necessarily expect a neural-architecture to approxmate better.

Alternatively, if your tiny 6-pair dataset is the target:

Your existing 'gold standard' answers don't seem obviously correct to me. Superficially, 'olive plant' & 'black olives' seem nearly as 'similar' as 'tomato' & 'tomato substance'. Similarly, 'watering plants' & 'cornsalad plant' about-as-similar as 'sweet potatoes' & 'potato chip'.
A mere 6 examples (maybe 5 after train/test split?) is both inadequate to usefully train a larger neural classifier, and to the extent the classifer might be easily trained (indeed 'overfit') to the 5 training examples, it won't necessarily have learned anything generalizable to the one hold-out example (which is using vectors quite far from the training texts). (With such a paucity of training data, & testing using inputs that might be arbitrarily different than the training data, "very poor" performance would be expected. Neural nets require lots of varied training examples!)

Finally, the strategy of creating combined-embeddings-by-averaging, as investigated by your linked paper, is another atypical practice that seems fishy to me. Even if it could offer some benefits, there's no reason to mix that atypical, somewhat non-intuitive extra practice into your experiment before even having things work with a more typical and simple baseline approach, for comparison, to be sure the extra 'meta'/averaging is worth the complication.

The paper itself doesn't really show any advantage over concatenation, which has a stronger theoretical basis (preserving each model's full independent spaces) than averaging, except by a tiny amount in 1-of-6 tests. Further, average of GLoVe & CBOW performs the same or worse than GLoVe alone on 3 on their 6 evaluations – and pretty minimally better on the 3 other evaluations. That implies to me the outperformance might be mainly random jitter introduced by the extra steps, and the averaging is – at best – a cheap option to consider as a tiny boost, not a generally-better approach.

The paper also leaves many natural related questions unaddressed:

Is averaging better than, say, just picking a random half of each models' dimensions for concatenation? That'd be even cheaper!
Might some of the slight lift in some tasks be due not to the averaging, but the other transformations they've applied – the l2-normalization applied to each source model, or across the whole of each dimension for the GLoVE model? (It's unclear if this model-postprocessing was only applied before dual-model averaging, or also to GLoVe in its solo evaluation.)

There's other work suggesting post-training transformations of word-vector spaces may improve performance on downstream tasks – see for example 'All But The Top' – so which steps, exactly, get which advantages is important to distinguish.