Search code examples
pythonmachine-learningnlpgensimdoc2vec

AttributeError: 'list' object has no attribute 'words' in python gensim module


While training using doc2vec, I got this error:

AttributeError: 'list' object has no attribute 'words' in python gensim module

This is my code:

# Extracting titles from csv to list
with open(query+'_titles.csv', 'rb') as f:
    reader = csv.reader(f)
    titlelist = list(reader)
# build
model = doc2vec.Doc2Vec(size=30, window=1, alpha=0.01, min_count=2, sample=1e-5, workers=100)
model.build_vocab(titlelist)
titlearray = np.asarray(titlelist)
print 'Training Model...'

I use python 2.7.11 and gensim version is 3.2.0 if that helps. There must be something I am really missing.


Solution

  • Doc2Vec requires not just the list of sentences, but the list of tagged sentences. From this discussion on DS.SE:

    In word2vec there is no need to label the words, because every word has their own semantic meaning in the vocabulary. But in case of doc2vec, there is a need to specify that how many number of words or sentences convey a semantic meaning, so that the algorithm could identify it as a single entity. For this reason, we are specifying labels or tags to sentence or paragraph depending on the level of semantic meaning conveyed.

    Consequently, Gensim expects the following input:

    sentences = [doc2vec.TaggedDocument(sentence, 'tag') for sentence in titlelist]
    model.build_vocab(sentences)
    

    Obviously, you might want to set different tags depending on the sentences to get meaningful vectors. By the way, are you sure you want to read CSV in binary mode?