I am using word2vec (gensim 4.3.3) on word embedding, results of the word vectors from the saved file 'wv.vectors.npy' shows that all word vectors are small, min of the entire array is -0.003 and max is 0.003, that each word is embedded with a very small vector, which is not expected. What seems to be the problems, is my corpus or word not good for the application of the word2vec model, or the somethings about the training settings?
I am working with mol2vec (https://github.com/samoturk/mol2vec), which embeds molecules into vectors using word2vec. I am trying to retrain the model with my own list of molecules, the "words" are ID numbers (no "real" words, they are just hashed numbers generated by morgan fingerprints representing sub-structure of the molecule, all words constitute a molecule or a sentence), the corpus look like
2246997334 3696389118 2246699815 2259502203 977461771 2245384272 1506993418 UNK 2245273601 1736287034 387666683 864662311 1542633699 2245277810 954800030 3006711714 864674487 1979311206 264864308 2246699815 3537119515 2246728737 3537119515
864942730 10565946 3217380708 328936174 3237386214 2132511834 2297887526 808456108 3218693969 584893129 864662311 2192318254
20 million sentences, with 0.3 million unique words
trained with mostly default settings adopted from the source mol2vec (original code use word2vec from older version of gensim, I change some code to adopt with newer version of gensim, which should not affect the performance? )
corpus = word2vec.LineSentence('smiles.cp.unk')
model = word2vec.Word2Vec(corpus, vector_size=300, window=10, min_count=4, workers=-1, sg=1)
The pretrained model provided by source mol2vec have vectors array with number from -2 to 2. Yet I tried different window size and vector size, all give similar results of small number vector of -0.003 to 0.003.
workers=-1
is not a valid parameter for Gensim's Word2Vec
model class. If you're mimicking some online example suggesting that value, it's a bad reference.
If you carefully review the console output of your run – especially if you've set relevant logging levels to WARNING – there'll probably be a message about that, or at the very least you might notice that your training completes instantly (no worker threads), which would not be expected with a corpus of tens of millions of words.
workers
should be a positive number that's strictly no larger than the number of CPU cores available.
If your system has 4 cores, 3-4 might be good values for workers
. (The value that actually achieves fastest training depends a bit on your other parameters and corpus.)
If your system has 8 cores, 4-8 might be good values.
If your system has 16 or more cores, 8-12 are usually good values. Above that count of workers – and certainly above 16 – Python & Gensim contention issues can cause extra worker threads to actually hurt training throughput, in the usual method of using a Python iterator corpus, even if youy have many more cores.