I'm using the "distance" script to find similar words over a word2vec that I have built. It contains around 1.6M words and was trained by this command:
./word2vec -train processed-text-2016.txt -output vec-cbow-neg.txt -debug 2 -threads 5 -size 300 -window 10 -sample 1e-3 -negative 10 -hs 0 -binary 0 -cbow 1 > w2v-neg.log &
My problem is that when I type any word, I get the following: Enter word or sentence (EXIT to break): rt
Word: rt Position in vocabulary: 658253
-0.000451 0.494857
356414 0.477918
9 0.441466
83 0.432876
63 0.431347
-0.020525 0.429472
.047345 0.425791
36 0.423420
242 0.418320
... ...
Enter word or sentence (EXIT to break): nd
Word: nd Position in vocabulary: 336527
3 0.494377
489 0.492153
632 0.483827
0.002335 0.462591
0693 0.458801
036869 0.452456
036819 0.447690
31 0.443887
... ...
Enter word or sentence (EXIT to break): and
Word: and Position in vocabulary: 1600843
080852 0.451752
57 0.438413
16577 0.437900
4 0.433538
.005464 0.429279
003131 0.422587
17380 0.420614
9 0.419624
5082 0.419569
0.019322 0.417945
.000435 0.417265
115991 0.414139
... ...
Enter word or sentence (EXIT to break): happy
Word: happy Position in vocabulary: -1 Out of dictionary word! Enter word or sentence (EXIT to break): man
Word: man Position in vocabulary: 470143
0.055039 0.488181
4793 0.455608
90743 0.454786
060493 0.453180
36 0.451387
6 0.450261
4 0.445118
830 0.442580
490 0.439919
0.025327 0.437766
0.005571 0.436606
0.001964 0.436544
-0.012627 0.434358
... ...
Enter word or sentence (EXIT to break): women
Word: women Position in vocabulary: -1 Out of dictionary word! Enter word or sentence (EXIT to break): queen
Word: queen Position in vocabulary: -1
If I grep these words from the model file (text file), I find them, so I'm not sure why this is happening or how to overcome this? Is it because of noise in data (I'm degugging this) or in params I used?
The answer is simply I'm using text format of the model not the binary format...