Search code examples
pythontensorflownlpword2vecdoc2vec

How to use vectors from Doc2Vec in Tensorflow


I am trying to use Doc2Vec to convert sentences to vectors, then use those vectors to train a tensorflow classifier.

I am a little confused at what tags are used for, and how to extract all of the document vectors from Doc2Vec after it has finished training.

My code so far is as follows:

fake_data = pd.read_csv('./sentences/fake.txt', sep='\n')
real_data = pd.read_csv('./sentences/real.txt', sep='\n')
sentences = []

for i, row in fake_data.iterrows():
    sentences.append(TaggedDocument(row['title'].lower().split(), ['fake', len(sentences)]))

for i, row in real_data.iterrows():
    sentences.append(TaggedDocument(row['title'].lower().split(), ['real', len(sentences)]))

model = gensim.models.Doc2Vec(sentences)

I get vectors when I do print(model.docvecs[1]) etc, but they are different every time I remake the model.

First of all: have I used Doc2Vec correctly? Second: Is there a way I can grab all documents tagged 'real' or 'fake', then turn them into a numpy array and pass it into tensorflow?


Solution

  • I believe the tag that you use for each TaggedDocument is not what you expect. Doc2Vec algorithm is learning vector representations of the specified tags (some of which can be shared between the documents). So if your goal is simply to convert sentences to vectors, the recommended choice of a tag is some kind of unique sentence identifier, such as sentence index.

    The learned model is then stored in model.docvecs. E.g., if you use sentence index as a tag, you can then get the 1st document vector by accessing model.docvecs for the tag "0", the second document - for the tag "1", and so on.

    Example code:

    documents = [doc2vec.TaggedDocument(sentence, ['real-%d' % i])
                 for i, sentence in enumerate(sentences)]
    model = doc2vec.Doc2Vec(documents, vector_size=10)  # 10 is just for illustration
    
    # Raw vectors are stored in `model.docvecs.vectors_docs`.
    # It's easier to access each one by the tag, which are stored in `model.docvecs.doctags`.
    for tag in model.docvecs.doctags.keys():
      print(tag, model.docvecs[tag])  # Prints the learned numpy array for this tag
    

    By the way, to control the model randomness, use seed parameter of Doc2Vec class.