The question is two-fold:
1. How to select the ideal value for size
?
2. How to get the vocabulary size dynamically (per row as I intend) to set that ideal size?
My data looks like the following (example)—just one row and one column:
Row 1
{kfhahf}
Lfhslnf;
.
.
.
Row 2
(stdgff ksshu, hsihf)
asgasf;
.
.
.
Etc.
Based on this post: Python: What is the "size" parameter in Gensim Word2vec model class The size
parameter should be less than (or equal to?) the vocabulary size. So, I am trying to dynamically assign the size as following:
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
# I do Word2Vec for each row
For item in dataset:
Tokenized = word_tokenize(item)
model = Word2Vec([Tokenized], min_count=1)
I get the vocabulary size here. So I create a second model:
model1 = Word2Vec([Tokenized], min_count=1, size=len(model.wv.vocab))
This sets the size
value to the current vocab value of the current row, as I intended. But is it the right way to do? What is the right size for a small vocabulary text?
There's no simple formula for the best size
- it will depend on your data and purposes.
The best practice is to devise a robust, automatable way to score a set of word-vectors for your purposes – likely with some hand-constructed representative subset of the kinds of judgments, and preferred results, you need. Then, try many values of size
(and other parameters) until you find the value(s) that score highest for your purposes.
In the domain of natural language modeling, where vocabularies are at least in the tens-of-thousands of unique words but possibly in the hundreds-of-thousands or millions, typical size
values are usually in the 100-1000 range, but very often in the 200-400 range. So you might start a search of alternate values around there, if your task/vocabulary is similar.
But if your data or vocabulary is small, you may need to try smaller values. (Word2Vec really needs large, diverse training data to work best, though.)
Regarding your code-as-shown:
there's unlikely any point to computing a new model
for every item
in your dataset (discarding the previous model
on each loop iteration). If you want a count of the unique tokens in any one tokenized item, you could use idiomatic Python like len(set(word_tokenize(item)))
. Any Word2Vec
model of interest would likely need to be trained on the combined corpus of tokens from all items.
it's usually the case that min_count=1
makes a model worse than larger values (like the default of min_count=5
). Words that only appear once generally can't get good word-vectors, as the algorithm needs multiple subtly-contrasting examples to work its magic. But, trying-and-failing to make useful word-vectors from such singletons tends to take up training-effort and model-state that could be more helpful for other words with adequate examples – so retaining those rare words even makes other word-vectors worse. (It is most definitely not the case that "retaining every raw word makes the model better", though it is almost always the case that "more real diverse data makes the model better".)