Search code examples
pythonextractgensimword2vec

How to extract matrix together with vocab from gensim word2vec model


I've trained a word2vec model like so

from gensim.models import Word2Vec

# create model without initializing
model = Word2Vec(min_count=20,
                 window=5,
                 sample=6e-5, 
                 negative=20,
                 workers=cores-1,
                 vector_size=300)

# build vocabulary
w2v_model.build_vocab(sentences, progress_per=10000)

# train model
model.train(sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)

I'd like to export the model as a dataframe, but not sure how to extract the matrix and vocab together correctly, with the right index positions.

Something like this:

label V1 V2 V...
government 0.560774564 -0.0464625023 ...
state 0.0106112240 0.0464625023 ...
.... ... .. .

I've tried this:

tmp = pd.DataFrame(model.syn1neg)
tmp.insert(0, 'label', model.wv.index_to_key)

which does not square up when comparing

>>> model.wv.get_index('government')
10
>>> tmp.loc[[0]]
0 government 0.329972  0.160003 -0.516633  ...  0.460873 -0.170273 -1.621128  1.255289

Solution

  • For anyone else looking for a solution to this with gensim 4.x.x here's what I wound up doing:

    vocab, vectors = model.wv.key_to_index, model.wv.vectors
    
    # get label and vector index.
    label_index = np.array([(voc[0], voc[1]) for voc in vocab.items()])
    
    # init dataframe using embedding vectors and set index as node name
    tmp =  pd.DataFrame(vectors[label_index[:,1].astype(int)])
    tmp.index = label_index[:, 0]
    tmp.to_csv("matrix_with_labels.csv")
    
    

    Not sure this is the best or proper way but it works.