I am trying to convert a bunch of poems into vectors, and then use my own implementation of k-means on them, but I can't figure out how to get the vectors with tags attached after training in doc2vec. I also find that when I train on 11 files I get 14 back out (I would like the same number of vectors in as out obviously).
My code takes in a path to a folder with a few text files in it. Right now I am just printing the vectors given by learner.docvecs
, and have no idea which vector is which.
Code:
import os
import gensim
def parse_doc2vec(direc):
# gets list of file names
files = os.listdir(direc)
translator = str.maketrans("", "", string.punctuation)
tovecpoems=[]
count = 0
for filename in files:
file = open(direc + "/" + filename)
list = file.read().split(sep= "\n")
subject = list[0].split(" ",1)[0]
poem = list[3:]
poem = ''.join(poem)
poem = poem.split()
for i in range (len(poem)):
poem[i] = poem[i].replace('\t','').replace('\n','')
poem[i] = poem[i].translate(translator)
if poem[i] == '':
poem.remove(poem[i])
tovecpoem = gensim.models.doc2vec.LabeledSentence(words = poem, tags = [filename,subject])
tovecpoems.append(tovecpoem)
count += 1
learner = gensim.models.doc2vec.Doc2Vec(tovecpoems,alpha=0.025, min_alpha=0.025)
for epoch in range(10):
learner.train(tovecpoems,total_examples = learner.corpus_count, epochs = learner.iter)
learner.alpha -= 0.002
learner.min_alpha = learner.alpha
vectors = learner.docvecs
for vec in vectors:
print(vec,'\n')
If someone could please tell me how to retrieve vectors with filename
attached from tags
, and why vectors
has more objects in it then tovecpoems
does, I would be grateful.
You should show what your code is printing, to let answerers know what's not appearing correct from what you see.
You're supplying Doc2Vec
with 10 text examples, but every text example you're providing has 2 tags
: tags=[filename, subject]
. So the 4 extra tags
, that aren't filenames, are probably unique values of subject
that repeat.
Separately, it's a bad idea to try to manage alpha
/min_alpha
yourself, or call train()
multiple times in your own loop. Just leave those values as their defaults, use Doc2Vec
's iter
parameter to specify how many training passes you want.
And by providing tovecpoems
as an argument to Doc2Vec
, you've already triggered training - there's no need to call train()
at all. (If you'd not supplied the corpus already, it would make sense to call build_vocab()
and train()
exactly once each, but no more.)
So for example your code would make more sense as just:
learner = gensim.models.doc2vec.Doc2Vec(tovecpoems, iter=10)
vectors = learner.docvecs
# ...etc
Note that you won't get a good Doc2Vec
/Word2Vec
results from tiny toy-sized datasets. They generally need many thousands of examples (containing hundred-of-thousands to millions of words) to obtain the sort of vector-qualities people usually want from them. (You can sometimes squeeze a little demo-able value out of a small dataset by (1) shrinking the model vector size
radically; and (2) increasing the number of iter
training-passes, but that's hit-or-miss.)