While training using doc2vec
, I got this error:
AttributeError: 'list' object has no attribute 'words' in python gensim module
This is my code:
# Extracting titles from csv to list
with open(query+'_titles.csv', 'rb') as f:
reader = csv.reader(f)
titlelist = list(reader)
# build
model = doc2vec.Doc2Vec(size=30, window=1, alpha=0.01, min_count=2, sample=1e-5, workers=100)
model.build_vocab(titlelist)
titlearray = np.asarray(titlelist)
print 'Training Model...'
I use python 2.7.11 and gensim version is 3.2.0 if that helps. There must be something I am really missing.
Doc2Vec
requires not just the list of sentences, but the list of tagged sentences. From this discussion on DS.SE:
In
word2vec
there is no need to label the words, because every word has their own semantic meaning in the vocabulary. But in case ofdoc2vec
, there is a need to specify that how many number of words or sentences convey a semantic meaning, so that the algorithm could identify it as a single entity. For this reason, we are specifying labels or tags to sentence or paragraph depending on the level of semantic meaning conveyed.
Consequently, Gensim expects the following input:
sentences = [doc2vec.TaggedDocument(sentence, 'tag') for sentence in titlelist]
model.build_vocab(sentences)
Obviously, you might want to set different tags depending on the sentences to get meaningful vectors. By the way, are you sure you want to read CSV in binary mode?