Search code examples
pythongensimword2vecword-embedding

Use Word2Vec to build a sense embedding


I really accept every hint on the following problem, because all what i want is to obtain that embedding from that dataset, I will write my all solution because (hopefully) the problem is just in some parts that i didn't consider.

I'm working with an annotated corpus, such that i have disambiguate words in a given sentence thanks to WordNet synsets id, that i will call tags. For example:

Dataset

<sentence>
  <text>word1 word2 word3</text>
  <annotations>
    <annotation anchor=word1 lemma=lemma1>tag1</annotation>
    <annotation anchor=word2 lemma=lemma2>tag2</annotation>
    <annotation anchor=word3 lemma=lemma3>tag3</annotation>
  <annotations>
</sentence>

Starting from this, given an embedding dimension that i will call n, i would like to build an embedding like this:

Embedding

lemma1_tag1 dim 1 dim 2 dim 3 ... dim n
lemma2_tag2 dim 1 dim 2 dim 3 ... dim n
lemma3_tag3 dim 1 dim 2 dim 3 ... dim n

I thought to generate a corpus for Word2Vec starting from each text of each sentence, and replace each anchor with the respective lemma1_tag1 (some words can contain more underscore, because i replaced space in lemmas with underscores). Since not every single word is annotated, after a simple preprocessing performed to remove stopwords and other punctuation, in the end i have something like the following example:

Corpus Example

let just list most_recent_01730444a headline_06344461n

Since I'm just interested in annotated words, I also generated a predefined vocabulary to use it as Word2Vec vocabulary. This file contains on each row entries like:

Vocabulary Example

lemma1_tag1
lemma2_tag2

So, after having defined a corpus and a vocabulary, I used them in Word2Vec toolkit:

Terminal emulation

./word2vec -train data/test.txt -output data/embeddings.vec -size 300 -window 7 -sample 1e-3 -hs 1 -negative 0 -iter 10 -min-count 1 -read-vocab data/dictionary.txt -cbow 1

Output

Starting training using file data/test.txt
Vocab size: 80
Words in train file: 20811

The problem is that the number of words in the corpus is 32000000+ and the number of words in the predefined vocabulary file is about 80000. I even tried in Python with Gensim, but (of course) I had the very same output. I think that the problem is that Word2Vec doesn't consider words in the format lemma1_tag1 because of the underscore, and i don't know how to solve this problem. Any hint is appreciated, thank you in advance!


Solution

  • Both the original word2vec.c from Google, and gensim's Word2Vec, handle words with underscores just fine.

    If both are looking at your input file, and both reporting just 80 unique words where you're expecting 100,000-plus, there's probably something wrong with your input-file.

    What does wc data/test.txt report?