Search code examples
pythonnumpydoc2vec

python error:" 'numpy.ndarray' object has no attribute 'words' " when training doc2vec


when I trained my doc2vec model, I passed through the dataset multiple times and shuffled the training reviews each time to improve accuracy. Then python gave me the AttributeError: 'numpy.ndarray' object has no attribute 'words'.Following is my python code:

def labelizeReviews(reviews, label_type):
  labelized = []
  for index, review in enumerate(reviews):
      label = ' %s_%s ' % (label_type, index)
      labelized.append(LabeledSentence(review, [label]))
  return labelized

x_train = labelizeReviews(x_train, 'TRAIN')  # input x_train is a list of word lists, each word list is a list of tokens of all words in one document
x_train=np.array(x_train)
model_dm = gensim.models.Doc2Vec(alpha=0.025, min_alpha=0.0001, iter=10, min_count=5, window=10, size=size, sample=1e-3,
                                 negative=5, workers=3)
for epoch in range(10):
    perm = np.random.permutation(x_train.shape[0])
    model_dm.train(x_train[perm], total_examples=model_dbow.corpus_count, epochs=model_dbow.iter)

and then the following is my error message:

Exception in thread Thread-4:
Traceback (most recent call last):
  File "C:\Users\123\Anaconda2\lib\threading.py", line 801, in __bootstrap_inner
    self.run()
  File "C:\Users\123\Anaconda2\lib\threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "C:\Users\123\Anaconda2\lib\site-packages\gensim-2.1.0-py2.7-win-amd64.egg\gensim\models\word2vec.py", line 857, in job_producer
    sentence_length = self._raw_word_count([sentence])
  File "C:\Users\123\Anaconda2\lib\site-packages\gensim-2.1.0-py2.7-win-amd64.egg\gensim\models\doc2vec.py", line 729, in _raw_word_count
    return sum(len(sentence.words) for sentence in job)
  File "C:\Users\123\Anaconda2\lib\site-packages\gensim-2.1.0-py2.7-win-amd64.egg\gensim\models\doc2vec.py", line 729, in <genexpr>
    return sum(len(sentence.words) for sentence in job)
AttributeError: 'numpy.ndarray' object has no attribute 'words'

Does anyone know how to solve this problem? Thanks a lot!!!


Solution

  • Pick a good demo/tutorial to use as your guide – first running it to see proper operation, then adjusting it to use your data or parameters instead.

    For example, there's a Doc2Vec introduction Jupyter notebook included with gensim, doc2vec-lee.ipynb. You can find it inside your installed gensim directory, in the docs/notebooks subdirectory, or view it online at:

    https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb

    Now, that demo is on unrealistically tiny toy dataset - just 300 short few-hundred-word documents. Doc2Vec typically won't give good results on such a tiny dataset. But this demo uses an atypically-small size (50 dimensions) and atypically-large iter (55) to eke out some usefulness.

    (With more typical training sets of tens of thousands to millions of documents, you could use more a typical size of 100 or more dimensions, and more typical iter of just 10-20.).

    But if you build on a good, working example like this, you won't make certain mistakes. For example:

    • You'll use the current recommended example class, TaggedDocument, not its older variant, LabeledSentence.

    • You won't turn your corpus into a numpy ndarray – a totally unnecessary step which is also the proximate cause of the error you're seeing.

    • You won't be calling train() multiple times in your own loop, which is error-prone and almost always the wrong thing to do unless you're an expert user paying careful attention to all parameter management. (You're doing 10 loops, and in each loop doing 10 passes over the data, and for each loop the class will managing the learning-rate alpha from 0.025 to 0.0001 – meaning it will jump up and down during training, which is almost certainly not what you'd want.)

    • You won't be making every single document have the same, single tag 'TRAIN`` – which meansDoc2Vec` can't possibly do anything useful. The algorithm needs a variety of documents, with different tags, to learn contrasting vectors for different documents/tags.