Search code examples
deep-learninggensimword2vecone-hot-encodingword-embedding

Word2Vec - How can I store and retrieve extra information regarding each instance of corpus?


I need to combine Word2Vec with my CNN model. To this end, I need to persist a flag (a binary one is enough) for each sentence as my corpus has two types (a.k.a. target classes) of sentences. So, I need to retrieve this flag of each vector after creation. How can I store and retrieve this information inside the input sentences of Word2Vec as I need both of them in order to train my deep neural network?

p.s. I'm using Gensim implementation of Word2Vec.

p.s. My corpus has 6,925 sentences, and Word2Vec produces 5,260 vectors.

Edit: More detail regarding my corpus (as requested):

The structure of the corpus is as follows:

  1. sentences (label: positive) -- A Python list

    • Feature-A: String
    • Feature-B: String
    • Feature-C: String
  2. sentences (label: negative) -- A Python list

    • Feature-A: String
    • Feature-B: String
    • Feature-C: String

Then all the sentences were given as the input to Word2Vec.

word2vec = Word2Vec(all_sentences, min_count=1)

I'll feed my CNN with the extracted features (which is the vocabulary in this case) and the targets of sentences. So, I need these labels of the sentences as well.


Solution

  • Because the Word2Vec model doesn't retain any representation of the individual training texts, this is entirely a matter for you in your own Python code.

    That doesn't seem like very much data. (It's rather tiny for typical Word2Vec purposes to have just a 5,260-word final vocabulary.)

    Unless each text (aka 'sentence') is very long, you could even just use a Python dict where each key is the full string of a sentence, and the value is your flag.

    But if, as is likely, your source data has some other unique identifier per text – like a unique database key, or even a line/row number in the canonical representation – you should use that identifier as a key instead.

    In fact, if there's a canonical source ordering of your 6,925 texts, you could just have a list flags with 6,925 elements, in order, where each element is your flag. When you need to know the status of a text from position n, you just look at flags[n].

    (To make more specific suggestions, you'd need to add more details about the original source of the data, and exactly when/why you'd need to be checking this extra property later.)