deep-learning nlp bert-language-model machine-translation

Most similar words within a given context

I want to create a deep learning model that can generate context-aware synonyms. I've been thinking about using BERT since it is bidirectional and creates good representations, and my idea was to use an approach where I provide the model with both the original sentence (e.g. "It is a beautiful house") and the same sentence but where the word I want to find synonyms for is masked (e.g. "It is a [MASK] house", if I want to find synonyms for beautiful).

A normal fill-mask obviously wouldn't work since it would not provide the model with the actual word we want to find synonyms for. I was thinking about using a machine translation-model (e.g. T5) instead, where you don't translate the sentence from one language to another but make an e.g. English-to-English translation where you provide the original sentence ("It is a beautiful house") as input to the encoder and the masked sentence ("It is a [MASK] house") as another input - this sentence would so to speak be the equivalent of an almost completed translation of the original sentence, and instead of simply translating the missing word, it would give me the top k most probable logits as synonyms.

However, I'm not sure how I can make this work at all... Another approach would be to train BERT on a domain-specific corpus and then get the k-nearest neighbors of the word I want to find synonyms for, but from what I've read it's not possible to get the word representations from a model like BERT the same way you would from Word2Vec and GloVe.

Any suggestions on how I can solve this challenge? Any help would be greatly appreciated...

Solution

If I were you, I would first push the sentence through a word sense disambiguation system like AMUSE. This will give you the WordNet synset that your word belongs to (and so you can find its synonyms from the WordNet synset).

Now you have a list of synonyms for the word in this context. As you planned, you can now use the MASK technique to find the probabilities of all the synonyms that you found earlier.

Important note: in BERT the MASK will always be filled by one token which is not necessarily a full word. This means that there will be a bias against your longer synonyms because they will likely never be generated.