I really accept every hint on the following problem, because all what i want is to obtain that embedding from that dataset, I will write my all solution because (hopefully) the problem is just in some parts that i didn't consider.
I'm working with an annotated corpus, such that i have disambiguate words in a given sentence thanks to WordNet synsets id, that i will call tags. For example:
<sentence>
<text>word1 word2 word3</text>
<annotations>
<annotation anchor=word1 lemma=lemma1>tag1</annotation>
<annotation anchor=word2 lemma=lemma2>tag2</annotation>
<annotation anchor=word3 lemma=lemma3>tag3</annotation>
<annotations>
</sentence>
Starting from this, given an embedding dimension that i will call n, i would like to build an embedding like this:
lemma1_tag1 dim 1 dim 2 dim 3 ... dim n
lemma2_tag2 dim 1 dim 2 dim 3 ... dim n
lemma3_tag3 dim 1 dim 2 dim 3 ... dim n
I thought to generate a corpus for Word2Vec starting from each text of each sentence, and replace each anchor
with the respective lemma1_tag1
(some words can contain more underscore, because i replaced space in lemmas with underscores). Since not every single word is annotated, after a simple preprocessing performed to remove stopwords and other punctuation, in the end i have something like the following example:
let just list most_recent_01730444a headline_06344461n
Since I'm just interested in annotated words, I also generated a predefined vocabulary to use it as Word2Vec vocabulary. This file contains on each row entries like:
lemma1_tag1
lemma2_tag2
So, after having defined a corpus and a vocabulary, I used them in Word2Vec toolkit:
./word2vec -train data/test.txt -output data/embeddings.vec -size 300 -window 7 -sample 1e-3 -hs 1 -negative 0 -iter 10 -min-count 1 -read-vocab data/dictionary.txt -cbow 1
Starting training using file data/test.txt
Vocab size: 80
Words in train file: 20811
The problem is that the number of words in the corpus is 32000000+ and the number of words in the predefined vocabulary file is about 80000. I even tried in Python with Gensim, but (of course) I had the very same output. I think that the problem is that Word2Vec doesn't consider words in the format lemma1_tag1
because of the underscore, and i don't know how to solve this problem. Any hint is appreciated, thank you in advance!
Both the original word2vec.c
from Google, and gensim's Word2Vec
, handle words with underscores just fine.
If both are looking at your input file, and both reporting just 80 unique words where you're expecting 100,000-plus, there's probably something wrong with your input-file.
What does wc data/test.txt
report?