I am using Word2Vec model for making a vectorizer from my data.
My data has custom/business defined synonym word list which i want my NLP model should consider.
For e.g if "A" is a synonym of "B" then if i try to find synonym word for "A" using Word2Vec then it should give "B" with 100% match.
I can try different NLP models as well provided i am able to achieve the above requirement.
Since your synonym list is very small, I'd recommend you train your model, then loop through your synonym list, reassigning the word vector of each synonym in your model (and adding any that were not present in your training data). This of course destroys any information about the learned word vectors, which depending on your use case may be a problem.
Some alternative approaches:
- Reassign word vectors to a composite of their constituent synonyms (e.g. mean). Unfortunately this works best if your synonyms already had similar vectors, which I assume is not the case. In general this probably loses more semantic information than it preserves.
- Add synonyms as duplicate words, (ie,a given word will have two vectors, one inferred from the training data and one equal to it's synonym). This preserves semantic relationships but creates ambiguity whenever you need to perform a computation using a vector that could have a synonym. How big of an issue this is depends on your use case.
- Apply a projection to your word embeddings subject to certain constraints (synonym equality). I've never tried this so I'm unsure how it impacts your model, how difficult it would be to compute, or what objective function you'd need to optimize.
- Preprocess your text. Yes, it'll take a while, but compute is cheap these days.