python tokenize gensim word2vec word-embedding

How to combine 3D token embeddings into 2D vectors?

I have a set of strings that I am tokenizing. I am sending each string into the word2vec model in gensim. Say, if there are 100 tokens (e.g. 'I', 'ate', 'pizza', etc.), it is generating a 100 * 100 3D matrix (list of list in python). How is it possible to convert the generated 3D token embeddings in to a 2D vector?

I am sending this 3D into a model in Tensorflow library. I am doing the following,

model.add(Embedding(max_features, 128, input_length=maxlen))

Here max_features is the size of the token vector i.e. 100 and input_length is also the same.

But I am not sure If this is getting the job done. Is it the right way to convert 3D token embeddings in to 2D vectors? Ideally, I want to covert the embeddings into 2D vectors before sending into the model.

Solution

The final results of training aren't really "3D" in usual Word2Vec/gensim terminology. If you've used Word2Vec with its default vector_size=100, and you had 100 vocabulary words, then you'd have 100 vectors of 100-dimensions each.

(Note: you would never want to create such high-dimensional "dense embedding" vectors for such a tiny vocabulary. The essential benefits of such dense representations come from forcing a much-larger set of entities into many-fewer dimensions, so that they are "compressed" into subtle, continuous, meaningful relative positions against each other. Giving 100 words a full 100 continuous dimensions, before Word2Vec training, will leave the model prone to severe overfitting. It could in fact then trend towards a "one-hot"-like encoding of each word, and become very good at the training task without really learning to pack related words near each other in a shared space – which is the usually-desired result of training. In my experience, for 100-dimension vectors, you probably want at least a 100^2 count of vocabulary words. If you really just care about 100 words, then you'd want to use much-smaller vectors – but also remember Word2Vec & related techniques are really meant for "large data" problems, with many subtly-varied training examples, and just barely sometimes give meaningful results on toy-sized data.)

The 100 vectors of 100-dimensions each are internally stored inside the Word2Vec model (& related components) as a raw numpy ndarray, which could be thought of as a "2d array" or "2d matrix". (It's not really a list of list unless you convert it to be that less-optimal form – though of course with Pythonic polymorphism you can generally pretend it was a list of list). If your gensim Word2Vec model is in w2v_model, then the raw numpy array of learned vectors is inside the w2v_model.wv.vectors property, though the interpretation of which row corresponds to which word-token depends on the w2v_model.wv.vocab dictionary entries.

As far as I can tell, the Tensorflow Embedding class is for training your own embeddings inside TF (though perhaps it can be initialized with vectors trained elsewhere). Its 1st initialization argument should the size-of-the-vocabulary (per your conjectured case 100), its second is the size-of-the-desired-embeddings (per your conjectured case, also 100 - but as noted above, this match of vocab-size and dense-embedding-size is inappropriate, and the example values in the TF docs of 1000 words and 64 dimensions would be more appropriately balanced).