I am trying to get started with word2vec
and doc2vec
using the excellent tutorials, here and here and trying to use the code samples. I only added in a line_clean()
method to remove punctuation, stopwords etc.
But I am having trouble with the line_clean()
method called in the training iterations. I understand the call to the global method is messing it up, but I am not sure how to get past this problem.
Iteration 1
Traceback (most recent call last):
File "/Users/santino/Dev/doc2vec_exp/doc2vec_exp_app/doc2vec/untitled.py", line 96, in <module>
train()
File "/Users/santino/Dev/doc2vec_exp/doc2vec_exp_app/doc2vec/untitled.py", line 91, in train
model.train(sentences.sentences_perm(),total_examples=model.corpus_count,epochs=model.iter)
File "/Users/santino/Dev/doc2vec_exp/doc2vec_exp_app/doc2vec/untitled.py", line 61, in sentences_perm
shuffled = list(self.sentences)
AttributeError: 'TaggedLineSentence' object has no attribute 'sentences'
My code is below:
import gensim
from gensim import utils
from gensim.models.doc2vec import TaggedDocument
from gensim.models import Doc2Vec
import os
import random
import numpy
from sklearn.linear_model import LogisticRegression
import logging
import sys
from nltk import RegexpTokenizer
from nltk.corpus import stopwords
tokenizer = RegexpTokenizer(r'\w+')
stopword_set = set(stopwords.words('english'))
def clean_line(line):
new_str = unicode(line, errors='replace').lower() #encoding issues
dlist = tokenizer.tokenize(new_str)
dlist = list(set(dlist).difference(stopword_set))
new_line = ' '.join(dlist)
return new_line
class TaggedLineSentence(object):
def __init__(self, sources):
self.sources = sources
flipped = {}
# make sure that keys are unique
for key, value in sources.items():
if value not in flipped:
flipped[value] = [key]
else:
raise Exception('Non-unique prefix encountered')
def __iter__(self):
for source, prefix in self.sources.items():
with utils.smart_open(source) as fin:
for item_no, line in enumerate(fin):
yield TaggedDocument(utils.to_unicode(clean_line(line)).split(), [prefix + '_%s' % item_no])
def to_array(self):
self.sentences = []
for source, prefix in self.sources.items():
with utils.smart_open(source) as fin:
for item_no, line in enumerate(fin):
self.sentences.append(TaggedDocument(utils.to_unicode(clean_line(line)).split(), [prefix + '_%s' % item_no]))
return(self.sentences)
def sentences_perm(self):
shuffled = list(self.sentences)
random.shuffle(shuffled)
return(shuffled)
def train():
#create a list data that stores the content of all text files in order of their names in docLabels
doc_files = [f for f in os.listdir('./data/') if f.endswith('.csv')]
sources = {}
for doc in doc_files:
doc2 = os.path.join('./data',doc)
sources[doc2] = doc.replace('.csv','')
sentences = TaggedLineSentence(sources)
# #iterator returned over all documents
model = gensim.models.Doc2Vec(size=300, min_count=2, alpha=0.025, min_alpha=0.025)
model.build_vocab(sentences)
#training of model
for epoch in range(10):
#random.shuffle(sentences)
print 'iteration '+str(epoch+1)
#model.train(it)
model.alpha -= 0.002
model.min_alpha = model.alpha
model.train(sentences.sentences_perm(),total_examples=model.corpus_count,epochs=model.iter)
#saving the created model
model.save('reddit.doc2vec')
print "model saved"
train()
Those aren't great tutorials for the latest versions of gensim
. In particular, it's a bad idea to be calling train()
multiple times in a loop with your own manual management of alpha
/min_alpha
. It's easy to mess up – the wrong things will happen in your code, for example – and offers no benefit for most users. Don't change min_alpha
from the default, and call train()
exactly once – it'll then do exactly epochs
iterations, decaying the learning-rate alpha
from its max to min values properly.
Your specific error is because your TaggedLineSentence
class doesn't have a sentences
property – at least not until after to_array()
is called – and yet the code is trying to access that non-existent property.
The whole to_array()
/sentences_perm()
approach is a bit broken. The reason for using such an iterable class is typically to keep a large dataset out of main-memory, streaming it from disk. But to_array()
then just loads everything, caching it inside the class - eliminating the iterable benefit. If you can afford that, because the full dataset easily fits in memory, you can just do...
sentences = list(TaggedLineSentence(sources)
...to iterate-from-disk once, then keep the corpus in an in-memory list.
And shuffling repeatedly during training isn't usually needed. Only if the training data has some existing clumping – like all the examples with certain words/topics are stuck together at the top or bottom of the ordering – is the native ordering likely to cause training problems. And in that case, a single shuffle, before any training, should be enough to remove the clumping. So again assuming your data fits in memory, you can just do...
sentences = random.shuffle(list(TaggedLineSentence(sources)
...once, then you've got a sentences
that's fine to pass to Doc2Vec
in both build_vocab()
and train()
(once) below.