Search code examples
python-3.xgensimfasttextoov

Cannot reproduce pre-trained word vectors from its vector_ngrams


Just curiosity, but I was debugging gensim's FastText code for replicating the implementation of Out-of-Vocabulary (OOV) words, and I'm not being able to accomplish it. So, the process i'm following is training a tiny model with a toy corpus, and then comparing the resulting vectors of a word in the vocabulary. That means if the whole process is OK, the output arrays should be the same.

Here is the code I've used for the test:

from gensim.models import FastText
import numpy as np
# Default gensim's function for hashing ngrams
from gensim.models._utils_any2vec import ft_hash_bytes

# Toy corpus
sentences = [['hello', 'test', 'hello', 'greeting'],
             ['hey', 'hello', 'another', 'test']]

# Instatiate FastText gensim's class
ft = FastText(sg=1, size=5, min_count=1, \
window=2, hs=0, negative=20, \
seed=0, workers=1, bucket=100, \
min_n=3, max_n=4)

# Build vocab
ft.build_vocab(sentences)

# Fit model weights (vectors_ngram)
ft.train(sentences=sentences, total_examples=ft.corpus_count, epochs=5)

# Save model
ft.save('./ft.model')
del ft

# Load model
ft = FastText.load('./ft.model')

# Generate ngrams for test-word given min_n=3 and max_n=4
encoded_ngrams = [b"<he", b"<hel", b"hel", b"hell", b"ell", b"ello", b"llo", b"llo>", b"lo>"]
# Hash ngrams to its corresponding index, just as Gensim does
ngram_hashes = [ft_hash_bytes(n) % 100 for n in encoded_ngrams]
word_vec = np.zeros(5, dtype=np.float32)
for nh in ngram_hashes:
    word_vec += ft.wv.vectors_ngrams[nh]

# Compare both arrays
print(np.isclose(ft.wv['hello'], word_vec))

The output of this script is False for every dimension of the compared arrays.

It would be nice if someone could point me out if i'm missing something or doing something wrong. Thanks in advance!


Solution

  • The calculation of a full word's FastText word-vector is not just the sum of its character n-gram vectors, but also a raw full-word vector that's also trained for in-vocabulary words.

    The full-word vectors you get back from ft.wv[word] for known-words have already had this combination pre-calculated. See the adjust_vectors() method for an example of this full calculation:

    https://github.com/RaRe-Technologies/gensim/blob/68ec5b8ed7f18e75e0b13689f4da53405ef3ed96/gensim/models/keyedvectors.py#L2282

    The raw full-word vectors are in a .vectors_vocab array on the model.wv object.

    (If this isn't enough to reconcile matters: ensure you're using the latest gensim, as there have been many recent FT fixes. And, ensure your list of ngram-hashes matches the output of the ft_ngram_hashes() method of the library – if not, your manual ngram-list-creation and subsequent hashing may be doing something different.)