Search code examples
pythonparsingdoc2vec

getting vector-tags pair after training in words2vec


I am trying to convert a bunch of poems into vectors, and then use my own implementation of k-means on them, but I can't figure out how to get the vectors with tags attached after training in doc2vec. I also find that when I train on 11 files I get 14 back out (I would like the same number of vectors in as out obviously).

My code takes in a path to a folder with a few text files in it. Right now I am just printing the vectors given by learner.docvecs, and have no idea which vector is which.

Code:

import os
import gensim

def parse_doc2vec(direc):

    # gets list of file names
    files = os.listdir(direc)

    translator = str.maketrans("", "", string.punctuation)

    tovecpoems=[]

    count = 0
    for filename in files:

        file = open(direc + "/" + filename)
        list = file.read().split(sep= "\n")
        subject = list[0].split(" ",1)[0]

        poem = list[3:]
        poem = ''.join(poem)
        poem = poem.split()
        for i in range (len(poem)):
            poem[i] = poem[i].replace('\t','').replace('\n','')
            poem[i] = poem[i].translate(translator)
            if poem[i] == '':
                poem.remove(poem[i])


        tovecpoem = gensim.models.doc2vec.LabeledSentence(words = poem, tags = [filename,subject])
        tovecpoems.append(tovecpoem)
        count += 1


    learner = gensim.models.doc2vec.Doc2Vec(tovecpoems,alpha=0.025, min_alpha=0.025)

    for epoch in range(10):
        learner.train(tovecpoems,total_examples = learner.corpus_count, epochs = learner.iter)
        learner.alpha -= 0.002
        learner.min_alpha = learner.alpha


    vectors = learner.docvecs

    for vec in vectors:
        print(vec,'\n')

If someone could please tell me how to retrieve vectors with filename attached from tags, and why vectors has more objects in it then tovecpoemsdoes, I would be grateful.


Solution

  • You should show what your code is printing, to let answerers know what's not appearing correct from what you see.

    You're supplying Doc2Vec with 10 text examples, but every text example you're providing has 2 tags: tags=[filename, subject]. So the 4 extra tags, that aren't filenames, are probably unique values of subject that repeat.

    Separately, it's a bad idea to try to manage alpha/min_alpha yourself, or call train() multiple times in your own loop. Just leave those values as their defaults, use Doc2Vec's iter parameter to specify how many training passes you want.

    And by providing tovecpoems as an argument to Doc2Vec, you've already triggered training - there's no need to call train() at all. (If you'd not supplied the corpus already, it would make sense to call build_vocab() and train() exactly once each, but no more.)

    So for example your code would make more sense as just:

    learner = gensim.models.doc2vec.Doc2Vec(tovecpoems, iter=10)
    vectors = learner.docvecs
    # ...etc
    

    Note that you won't get a good Doc2Vec/Word2Vec results from tiny toy-sized datasets. They generally need many thousands of examples (containing hundred-of-thousands to millions of words) to obtain the sort of vector-qualities people usually want from them. (You can sometimes squeeze a little demo-able value out of a small dataset by (1) shrinking the model vector size radically; and (2) increasing the number of iter training-passes, but that's hit-or-miss.)