I'm looking for test datasets to optimize my Word2Vec model. I have found a good one from gensim:
gensim/test/test_data/questions-words.txt
Does anyone know other similar datasets?
Thank you!
It is important to note that there isn't really a "ground truth" for word-vectors. There are interesting tasks you can do with them, and some arrangements of word-vectors will be better on a specific tasks than others.
But also, the word-vectors that are best on one task – such as analogy-solving in the style of the questions-words.txt
problems – might not be best on another important task – like say modeling texts for classification or info-retrieval.
That said, you can make your own test data in the same format as questions-words.txt
. Google's original word2vec.c
release, which also included a tool for statistically combining nearby words into multi-word phrases, also included a questions-phrases.txt
file, in the same format, that can be used to test word-vectors that have been similarly constructed for 'words' that are actually short multiple-word phrases.
The Python gensim
word-vectors support includes an extra method, evaluate_word_pairs()
for checking word-vectors not on analogy-solving but on conformance to collections of human-determined word-similarity-rankings. The documentation for that method includes a link to an appropriate test-set for that method, SimLex-999
, and you may be able to find other test sets of the same format elsewhere.
But, again, none of these should be considered the absolute test of word-vectors' overall quality. The best test, for your particular project's use of word-vectors, would be some repeatable domain-specific evaluation score you devise yourself, that's inherently correlated to your end goals.