I've trained a word2vec model like so
from gensim.models import Word2Vec
# create model without initializing
model = Word2Vec(min_count=20,
window=5,
sample=6e-5,
negative=20,
workers=cores-1,
vector_size=300)
# build vocabulary
w2v_model.build_vocab(sentences, progress_per=10000)
# train model
model.train(sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)
I'd like to export the model as a dataframe, but not sure how to extract the matrix and vocab together correctly, with the right index positions.
Something like this:
label | V1 | V2 | V... |
---|---|---|---|
government | 0.560774564 | -0.0464625023 | ... |
state | 0.0106112240 | 0.0464625023 | ... |
.... | ... | .. | . |
I've tried this:
tmp = pd.DataFrame(model.syn1neg)
tmp.insert(0, 'label', model.wv.index_to_key)
which does not square up when comparing
>>> model.wv.get_index('government')
10
>>> tmp.loc[[0]]
0 government 0.329972 0.160003 -0.516633 ... 0.460873 -0.170273 -1.621128 1.255289
For anyone else looking for a solution to this with gensim 4.x.x here's what I wound up doing:
vocab, vectors = model.wv.key_to_index, model.wv.vectors
# get label and vector index.
label_index = np.array([(voc[0], voc[1]) for voc in vocab.items()])
# init dataframe using embedding vectors and set index as node name
tmp = pd.DataFrame(vectors[label_index[:,1].astype(int)])
tmp.index = label_index[:, 0]
tmp.to_csv("matrix_with_labels.csv")
Not sure this is the best or proper way but it works.