machine-learning nlp word2vec machine-translation

Using word2vec in a sentence

I'm trying to generate the probability of a given sentence to be be correct.

I have word2vec for each token in the language and I want to predict the probability of the sentence to be correct. I'm unable to create a suitable model. How can I proceed ?

Solution

Word-vectors alone won't help you do this.

While their similarities and relative orientations are trained by predicting word-cooccurrences, the vectors alone aren't a clear guide to words which co-occur. And the word-vectors definitely don't encode rules-of-grammatical-usage, as mere proximity, not proper ordering, is the usual training input.

That said, if you happened to be using the Python gensim implementation of Word2Vec, and if you train a full model yourself (as opposed to using off-the-shelf pre-trained vectors), that whole model will, in some modes, support a score() method that grades a set of sentences on how well they conform with the model's expectations. It won't tell you whether a text is "correct", just whether it's "familiar" – and was 1st proposed/added as a possible way of applying multiple contrasting Word2Vec models to aid in classification problems. You can read more about this method, and find links to the research paper that motivated it and a demo usage, in the method documentation:

https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.score