Search code examples
kerasgensimword-embeddingdoc2vec

Export gensim doc2vec embeddings into separate file to use with keras Embedding layer later


I am a bit new to gensim and right now I am trying to solve the problem which involves using the doc2vec embeddings in keras. I wasn't able to find existing implementation of doc2vec in keras - as far as I see in all examples I found so far everyone just uses the gensim to get the document embeddings.

Once I trained my doc2vec model in gensim I need to export embeddings weights from genim into keras somehow and it is not really clear on how to do that. I see that

model.syn0

Supposedly gives the word2vec embedding weights (according to this). But it is unclear how to do the same export for document embeddings. Any advise?

I know that in general I can just get the embeddings for each document directly from gensim model but I want to fine-tune the embedding layer in keras later on, since doc embeddings will be used as a part of a larger task hence they might be fine-tuned a bit.


Solution

  • I figured this out.

    Assuming you already trained the gensim model and used string tags as document ids:

    #get vector of doc
    model.docvecs['2017-06-24AEON']
    #raw docvectors (all of them)
    model.docvecs.doctag_syn0
    #docvector names in model
    model.docvecs.offset2doctag
    

    You can export this doc vectors into keras embedding layer as below, assuming your DataFrame df has all of the documents out there. Notice that in the embedding matrix you need to pass only integers as inputs. I use raw number in dataframe as the id of the doc for input. Also notice that embedding layer requires to not touch index 0 - it is reserved for masking, so when I pass the doc id as input to my network I need to ensure it is >0

    #creating embedding matrix
    embedding_matrix = np.zeros((len(df)+1, text_encode_dim))
    for i, row in df.iterrows():
        embedding = modelDoc2Vec.docvecs[row['docCode']]
        embedding_matrix[i+1] = embedding
    
    #input with id of document
    doc_input = Input(shape=(1,),dtype='int16', name='doc_input')
    #embedding layer intialized with the matrix created earlier
    embedded_doc_input = Embedding(output_dim=text_encode_dim, input_dim=len(df)+1,weights=[embedding_matrix], input_length=1, trainable=False)(doc_input)
    

    UPDATE

    After late 2017, with the introduction of Keras 2.0 API very last line should be changed to:

    embedded_doc_input = Embedding(output_dim=text_encode_dim, input_dim=len(df)+1,embeddings_initializer=Constant(embedding_matrix), input_length=1, trainable=False)(doc_input)