machine-learning nlp word2vec word-embedding

Question pairs (ground truth) datasets for Word2Vec model testing?

I'm looking for test datasets to optimize my Word2Vec model. I have found a good one from gensim:

gensim/test/test_data/questions-words.txt

Does anyone know other similar datasets?

Thank you!

Solution

It is important to note that there isn't really a "ground truth" for word-vectors. There are interesting tasks you can do with them, and some arrangements of word-vectors will be better on a specific tasks than others.

But also, the word-vectors that are best on one task – such as analogy-solving in the style of the questions-words.txt problems – might not be best on another important task – like say modeling texts for classification or info-retrieval.

That said, you can make your own test data in the same format as questions-words.txt. Google's original word2vec.c release, which also included a tool for statistically combining nearby words into multi-word phrases, also included a questions-phrases.txt file, in the same format, that can be used to test word-vectors that have been similarly constructed for 'words' that are actually short multiple-word phrases.

The Python gensim word-vectors support includes an extra method, evaluate_word_pairs() for checking word-vectors not on analogy-solving but on conformance to collections of human-determined word-similarity-rankings. The documentation for that method includes a link to an appropriate test-set for that method, SimLex-999, and you may be able to find other test sets of the same format elsewhere.

But, again, none of these should be considered the absolute test of word-vectors' overall quality. The best test, for your particular project's use of word-vectors, would be some repeatable domain-specific evaluation score you devise yourself, that's inherently correlated to your end goals.