I am predicting similarities of documents using the pre trained spacy word embeddings. Because I have a lot of domain specific words, I want to fine tune my vectors on a rather small data set containing my domain specific vocabulary.
My idea was to just train the spacy model again with my data. But since the word vectors in spacy are built-in, I am not sure how to do that. Is there a way to train the spacy model again with my data?
After some research, I found out, that I can train my own vectors using Gensim. There I would have to download a pre trained model for example the Google News dataset model and afterwards I could train it again with my data set. Is this the only way? Or is there a way to proceed with my spacy model?
Any help is greatly appreciated.
update: the right term here was "incremental training" and thats not possible with the pre-trained spacy models.
It is however possible, to perform incremental training on a gensim
model. I did that with the help of another pretrained vector set (i went with the fasttext
model) and then I trained this gensim
model trained with the fasttext
vectors again with my own corpus. This worked pretty well