I want to implement a Keras model to predict the similarity between two sentences from words embeddings as follows (I included my full script at the end):
X1
and X2
) by computing the average word vectors for all words in a sentence. If two or more models are used, calculate the arithmetic mean of all embeddings (Frustratingly Easy Meta-Embedding -- Computing Meta-Embeddings by Averaging Source Word Embeddings).X1
and X2
into one array before feeding them to the network.The entire script is as follows:
import numpy as np
from gensim.models import Word2Vec
from keras.layers import Dense
from keras.models import Sequential
from sklearn.model_selection import train_test_split
def encoder_vector(v: str, model: Word2Vec) -> np.array:
wv_dim = model.vector_size
if v in model.wv:
return model.wv[v]
else:
return np.zeros(wv_dim)
def encoder_words_avg(words: list[str], model: Word2Vec) -> np.array:
dim = model.vector_size
words = [word for word in words if word in model.wv]
if len(words) >= 1:
return np.mean(model.wv[words], axis=0)
else:
return np.zeros(dim)
def load_samples(mappings, w2v_model, fast_model):
dim = w2v_model.vector_size
num = len(mappings)
X1 = np.zeros((num, dim))
X2 = np.zeros((num, dim))
y = np.zeros((num, 1))
for i in range(num):
mapping = mappings[i].split("|")
sentence_1, sentence_2 = mapping[1:]
e = np.zeros((2, dim))
# Compute meta-embedding by averaging all embeddings.
e[0, :] = encoder_words_avg(words=sentence_1.split(), model=w2v_model)
e[1, :] = encoder_words_avg(words=sentence_1.split(), model=fast_model)
X1[i] = e.mean(axis=0)
e[0, :] = encoder_words_avg(words=sentence_2.split(), model=w2v_model)
e[1, :] = encoder_words_avg(words=sentence_2.split(), model=fast_model)
X2[i] = e.mean(axis=0)
y[i] = 0.0 if mapping[0].startswith("-") else 1.0
return X1, X2, y
def baseline_model(X_train, X_test, y_train, y_test):
model = Sequential()
model.add(
Dense(
200,
input_shape=(X_train.shape[1],),
activation="relu",
kernel_initializer="he_uniform",
)
)
model.add(Dense(1, activation="sigmoid"))
model.compile(optimizer="sgd", loss="binary_crossentropy", metrics=["accuracy"])
model.fit(X_train, y_train, batch_size=8, epochs=14)
# Evaluate the trained model, using the train and test data
_, train_acc = model.evaluate(X_train, y_train, verbose=0)
_, test_acc = model.evaluate(X_test, y_test, verbose=0)
print("Train: %.3f, Test: %.3f\n" % (train_acc, test_acc))
return model
def main():
w2v_model = Word2Vec.load("")
fast_model = Word2Vec.load("")
mappings = [
"1|boiled chicken egg|hen egg whole boiled",
"2|tomato|tomato substance",
"3|sweet potatoes|potato chip",
"-1|watering plants|cornsalad plant",
"-2|butter|butane",
"-3|olive plant|black olives",
]
X1, X2, y = load_samples(mappings, w2v_model=w2v_model, fast_model=fast_model)
# Concatenate both arrays into one before feeding to the network.
X = np.concatenate([X1, X2], axis=1)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
model = baseline_model(X_train, X_test, y_train, y_test)
model.summary()
The above script seems to work, but the prediction result is very poor even when using only Word2Vec (which makes me think there could be an issue with the Keras model...). Any ideas on how to improve the outcome? Am I doing something wrong?
Thank you.
It's unclear what you're intending to predict.
Do you want your Keras NN to report the same value as the precise cosine-similarity calculation, between the two text summary vectors, would report? If so, why not just... do the calculation? It's not something I'd necessarily expect a neural-architecture to approxmate better.
Alternatively, if your tiny 6-pair dataset is the target:
Your existing 'gold standard' answers don't seem obviously correct to me. Superficially, 'olive plant' & 'black olives' seem nearly as 'similar' as 'tomato' & 'tomato substance'. Similarly, 'watering plants' & 'cornsalad plant' about-as-similar as 'sweet potatoes' & 'potato chip'.
A mere 6 examples (maybe 5 after train/test split?) is both inadequate to usefully train a larger neural classifier, and to the extent the classifer might be easily trained (indeed 'overfit') to the 5 training examples, it won't necessarily have learned anything generalizable to the one hold-out example (which is using vectors quite far from the training texts). (With such a paucity of training data, & testing using inputs that might be arbitrarily different than the training data, "very poor" performance would be expected. Neural nets require lots of varied training examples!)
Finally, the strategy of creating combined-embeddings-by-averaging, as investigated by your linked paper, is another atypical practice that seems fishy to me. Even if it could offer some benefits, there's no reason to mix that atypical, somewhat non-intuitive extra practice into your experiment before even having things work with a more typical and simple baseline approach, for comparison, to be sure the extra 'meta'/averaging is worth the complication.
The paper itself doesn't really show any advantage over concatenation, which has a stronger theoretical basis (preserving each model's full independent spaces) than averaging, except by a tiny amount in 1-of-6 tests. Further, average of GLoVe & CBOW performs the same or worse than GLoVe alone on 3 on their 6 evaluations – and pretty minimally better on the 3 other evaluations. That implies to me the outperformance might be mainly random jitter introduced by the extra steps, and the averaging is – at best – a cheap option to consider as a tiny boost, not a generally-better approach.
The paper also leaves many natural related questions unaddressed:
There's other work suggesting post-training transformations of word-vector spaces may improve performance on downstream tasks – see for example 'All But The Top' – so which steps, exactly, get which advantages is important to distinguish.