As of v2.1, spaCy has a BERT-style language model (LM). It predicts word-vectors instead of words, so I am going to use "words" and "word vectors" interchangeably here.
I need to take a sentence with a word masked, and a list of words, and rank the words by how likely they are to appear in the masked slot. Currently I am using BERT for this (similar to bert-syntax). I would like to see if spaCy's performance on this task is acceptable. Between this file and this one I'm pretty sure it's possible to build something. However, it feels like reaching deeper into the internals of the library than I'd like.
Is there a straightforward way to interact with spaCy's masked language model?
This is basically the disadvantage of the LMAO approximation. I actually hadn't realised this until it was pointed out to me by someone on the /r/machinelearning
subreddit.
Because we're predicting a vector, we really only get to predict one point in the vector-space. This is really different from predicting a distribution over the words. Imagine we had a gap like The __ of corn.
Let's say a good distribution of fillers for that would be {kernel, ear, piece}
. The vectors for these words aren't especially close, as the word2vec
algorithm is constructing a vector space based on all contexts of the words, and the words are only interchangeable in this context. In the vast majority of uses of piece
, the word ear
would be a really bad substitution.
If the likely fillers aren't close together in the vector-space, there will be no way for the LMAO model to return you an answer that corresponds to that set of words.
If you only need the 1-best answer, the algorithm in spacy pretrain
has a good chance of giving it to you. But if you need the distribution, the approximation breaks down, and you should use something like BERT
.