Search code examples
pythonscikit-learnnlpgensimword2vec

Fine tune a custom word2vec model with gensim 4


I am new using gensim, especially with gensim 4. To be honest, I found quite hard to understand the docs how to fine-tune a pre-trained word2vec model. I have a binary pre-trained model saved local. I would like to fine tune this model on new data.

My questions are;

  • how to create the vocab merging both vocabs?
  • is that the correct approach to fine-tune a word2vec model?

So far i have created the following code:

# path to pretrained model
pretrained_path = '../models/german.model'

# new data
sentences = df.stem_token_wo_sw.to_list() # Pandas column containing text data

# Create new model
w2v_de = Word2Vec(
    min_count = min_count,
    vector_size = vector_size,
    window = window,
    workers = workers,
)

# Build vocab
w2v_de.build_vocab(sentences)

# Extract number of examples
total_examples = w2v_de.corpus_count

# Load pretrained model
model = KeyedVectors.load_word2vec_format(pretrained_path, binary=True)

# Add previous words from pretrained model
w2v_de.build_vocab([list(model.key_to_index.keys())], update=True)

# Train model
w2v_de.train(sentences, total_examples=total_examples, epochs=2)

# create array of vectors
vectors = np.asarray(w2v_de.wv.vectors)
# create array of labels
labels = np.asarray(w2v_de.wv.index_to_key) 

# create dataframe of vectors for each word
w_emb = pd.DataFrame(
    index = labels,
    columns = [f'X{n}' for n in range(1, vectors.shape[1] + 1)],
    data = vectors,
)

After training I use PCA to reduce the dimensions from 300 to two, in order to plot the word-embedding space.

# create pipeline
pipeline = Pipeline(
    steps = [
        # ('scaler', StandardScaler()),
        ('pca', PCA(n_components=2)),
    ]
)

# fit pipeline
pipeline.fit(w_emb)

# Transform vectors
vectors_transformed = pipeline.transform(w_emb)

w_emb_transformed = (
    pd.DataFrame(
        index = labels,
        columns = ['PC1', 'PC2'],
        data = vectors_transformed,
    )
)

The labels and vectors do only contain the new words, and not the old + new words and so does my plot and PCA values.


Solution

  • There are no official Gnesim docs on how to fine-tune a Word2Vec model because there's no well-established/reliable way to do fine-tuning & be sure it's helping.

    There's thus no direct support in Gensim, nor standard recipe that Gensim could recommend to non-expert users.

    People have patched together approaches, reaching into Gensim steps/models directly, to try to accomplish fine-tuning. But the average quality of such write-ups that I've seen is very poor, with little evaluation of whether the steps are working, or discussion of the tradeoffs and considerations when expanding beyond the write-up's toy setup.

    That is: they're often misleading the unaware into thinking this is a well-esablished process, with dependable results, when it's not.

    Regarding your process, some comments:

    • Your initial creation of a vocabulary will get all of your corpus's words into the model, with accurate frequency counts based on your corpus. (Frequencies affect how a model does negative-sampling & frequent-word downsampling, and which words get ignored entirely because they appear fewer times than the configured min_count.)
    • You are then successfully requesting the model's vocabulary expand with the .build_vocab(..., update=True) call - but by providing a mere list of the words in the new corpus, every word gets an effective occurence-count of just 1. With sensible values of min_count (such as the default 5, or higher when your corpus is larg enough), none of those word from the pre-trained model will be added to the vocabulary.
    • But even if you did fix this step – either setting min_count unwisely-low, or artificially repeating the words – the build_vocab() step only makes slots for a word, & randomly-initialized its vector to ready the word for training. You're not doing anything to copy over the actual vectors from the model into w2v_de. So all those 'borrowed' words will just be untrained noise in your actual model. And, these words don't have accurate frequency counts to participate properly in training.
    • When you train, on just your corpus, only your local-corpus words will appear in the corpus, and thus appear in the positive word-to-word training examples. But some of the imported words (if any) will occasionally be chosen as negative-examples, if you're using negative-sampling mode. (But, they won't be chosen at the typical frequencies - because of the lack of frequency info.) So you'll have a weird training run, primarily updating only your corpus's words, sometimes negative-example updating the other words (but never positive-updating them). The randomly-initialized imported words will thus be skewed further, but not in any useful way.

    At the end, you might have passable vectors for your in-corpus words. (Though: epochs=2 is unlikely to be sufficient training unless your corpus is so vary large that every word of interest appears in many, many diverse contexts.) But the words you tried to import will have just junk vectors, having been initialized randomly, never influenced in their weights by your pretrained model at all, just skewed a bit by sometimes appearing as negative examples.

    In short: a mess, with the extra non-standard steps attempting fine-tuning doing nothing useful. (If you've copied this pattern from an online resource faithfully – that resource may have been offered by an author that didn't know what they were doing.)

    A far surer approach, if you find your corpus is missing words, is to obtain a larger corpus. As one example, if your pretrained vectors were trained on something like Wikipedia, you can just mix your corpus with Wikipedia texts, to have a combined corpus with good usage example of all the same words. (In some cases, you might be able to find corpus-extending materials that are more appropriate for your project/domain than generic Wikipedia reference text. Alternatively, you might choose to interleave & repeat your corpus to essentially give your texts greater weight, in the combined corpus.)

    A straightforward from-scratch training-run on this new extended corpus will co-train all words in the same model, with accurate counts matching words' appearances in the combined corpus.

    Another approach that's sometimes used to re-use word-vectors from elsewhere is to learn a projection between your new/small model, and the pretrained/larger model, based on words that are shared between the two models. Then, use that projection to move the extra words needed – in one or the other model – to new positions, that render them comparable, "in the same coordinate space", to the other imported vectors. There's an example of doing this in the Gensim TranslationMatrix class & demo notebook.