I need to combine Word2Vec with my CNN
model. To this end, I need to persist a flag (a binary one is enough) for each sentence as my corpus has two types (a.k.a. target classes) of sentences. So, I need to retrieve this flag of each vector after creation. How can I store and retrieve this information inside the input sentences of Word2Vec
as I need both of them in order to train my deep neural network?
p.s. I'm using Gensim
implementation of Word2Vec
.
p.s. My corpus has 6,925 sentences, and Word2Vec
produces 5,260 vectors.
Edit: More detail regarding my corpus (as requested):
The structure of the corpus is as follows:
sentences (label: positive
) -- A Python list
Feature-A
: StringFeature-B
: StringFeature-C
: Stringsentences (label: negative
) -- A Python list
Feature-A
: StringFeature-B
: StringFeature-C
: StringThen all the sentences were given as the input to Word2Vec
.
word2vec = Word2Vec(all_sentences, min_count=1)
I'll feed my CNN with the extracted features (which is the vocabulary
in this case) and the targets
of sentences. So, I need these labels of the sentences as well.
Because the Word2Vec
model doesn't retain any representation of the individual training texts, this is entirely a matter for you in your own Python code.
That doesn't seem like very much data. (It's rather tiny for typical Word2Vec
purposes to have just a 5,260-word final vocabulary.)
Unless each text (aka 'sentence') is very long, you could even just use a Python dict where each key is the full string of a sentence, and the value is your flag.
But if, as is likely, your source data has some other unique identifier per text – like a unique database key, or even a line/row number in the canonical representation – you should use that identifier as a key instead.
In fact, if there's a canonical source ordering of your 6,925 texts, you could just have a list flags
with 6,925 elements, in order, where each element is your flag. When you need to know the status of a text from position n
, you just look at flags[n]
.
(To make more specific suggestions, you'd need to add more details about the original source of the data, and exactly when/why you'd need to be checking this extra property later.)