python artificial-intelligence gensim training-data fasttext

Training a model from multiple corpus

Imagine I have a fasttext model that had been trained thanks to the Wikipedia articles (like explained on the official website). Would it be possible to train it again with another corpus (scientific documents) that could add new / more pertinent links between words? especially for the scientific ones ?

To summarize, I would need the classic links that exist between all the English words coming from Wikipedia. But I would like to enhance this model with new documents about specific sectors. Is there a way to do that ? And if yes, is there a way to maybe 'ponderate' the trainings so relations coming from my custom documents would be 'more important'.

My final wish is to compute cosine similarity between documents that can be very scientific (that's why to have better results I thought about adding more scientific documents)

Solution

Adjusting more-generic models with your specific domain training data is often called "fine-tuning".

The gensim implementation of FastText allows an existing model to expand its known-vocabulary via what's seen in new training data (via build_vocab(..., update=True)) and then for further training cycles including that new vocabulary to occur (through train()).

But, doing this particular form of updating introduces murky issues of balance between older and newer training data, with no clear best practices.

As just one example, to the extent there are tokens/ngrams in the original model that don't recur in the new data, new training is pulling those in the new data into new positions that are optimal for the new data... but potentially arbitrarily far from comparable compatibility with the older tokens/ngrams.)

Further, it's likely some model modes (like negative-sampling versus hierarchical-softmax), and some mixes of data, have a better chance of net-benefiting from this approach than others – but you pretty much have to hammer out the tradeoffs yourself, without general rules to rely upon.

(There may be better fine-tuning strategies for other kinds models; this is just speaking to the ability of the gensim FastText to update-vocabulary and repeat-train.)

But perhaps, your domain of interest is scientific texts. And maybe you also have a lot of representative texts – perhaps even, at training time, the complete universe of papers you'll want to compare.

In that case, are you sure you want to deal with the complexity of starting with a more-generic word-model? Why would you want to contaminate your analysis with any of the dominant word-senses in generic reference material, like Wikipedia, if in fact you already have sufficiently-varied and representative examples of your domain words in your domain contexts?

So I would recommend 1st trying to train your own model, from your own representative data. And only if you then fear you're missing important words/senses, try mixing in Wikipedia-derived senses. (At that point, another way to mix in that influence would be to mix Wikipedia texts with your other corpus. And you should also be ready to test whether that really helps or hurts – because it could be either.)

Also, to the extent your real goal is comparing full papers, you might want to look into other document-modeling strategies, including bag-of-words representations, the Doc2Vec ('Paragraph Vector') implementation in gensim, or others. Those approaches will not necessarily require per-word vectors as an input, but might still work well for quantifying text-to-text similarities.