Search code examples
pythonnlpartificial-intelligenceword2vecword-embedding

Is it possible to fine-tune a pretrained word embedding model like vec2word?


I'm working on semantic matching in my search engine system. I saw that word embedding can be used for this task. However, my dataset is very limited and small, so I don't think that training a word embedding model such as word2vec from scratch will yield good results. As such, I decided to fine-tune a pre-trained model with my data.

However, I can't find a lot of information, such as articles or documentation, about fine-tuning. Some people even say that it's impossible to fine-tune a word embedding model.

This raises my question: is fine-tuning a pre-trained word embedding model possible and has anyone tried this before? Currently, I'm stuck and looking for more information. Should I try to train a word embedding model from scratch or are there other approaches?


Solution

  • As has been pointed out before, there is no "go-to" way for fine-tuning Word2Vec type models.

    I would suggest training your own model from scratch, combining your data with other available data from a similar domain. Word2vec models are fairly quick to train and this would probably give you the best results. If you do not need static word-level embeddings, I would recommend considering contextualized embeddings, for example through the use of sentence-transformers or similar frameworks, which has a wide selection of already pre-trained models you can choose from. You can fine-tune these types of models on your specific data rather easily, and there are tons of resources online on how to do that.

    For your use case, you can embed all the documents into dense vector representations using the abovementioned library, and then construct a searchable index over this semantic space. In order to match queries, all you have to do then is to embed the query using the same model and then retrieve the documents with the highest approximate inner product, often referred to as a MIPS search. An example library to take a look at would be faiss.