Search code examples
pythonk-meansgensimdoc2vec

Gensim Doc2vec model clustering into K-means


I'm new to doc2vec and I hope some one of you can help me with this issue. I've asked a number of people about this issue, but nobody knows the solution.

What I wanto to do is cluster Doc2vec result into k-means. Please see below the code.

mbk = MiniBatchKMeans(n_clusters=3, init_size=400, batch_size=300, verbose=1).fit(model_dm.docvecs[range([2000])                                                                                                 
MiniBatchKMeans.predict(mbk,mbk.labels_ )

I'm getting this Error.

TypeErrorTraceback (most recent call last)
<ipython-input-19-fbc57a13bf4b> in <module>()
      6 
      7 
----> 8 mbk = MiniBatchKMeans(n_clusters=3, init_size=400, batch_size=300, verbose=1).fit(model_dm.docvecs[:2000])
      9 
     10 #model_dm.docvecs.doctag_syn0[2000]

/usr/local/lib64/python2.7/site-packages/gensim/models/doc2vec.pyc in __getitem__(self, index)
    351             return self.doctag_syn0[self._int_index(index)]
    352 
--> 353         return vstack([self[i] for i in index])
    354 
    355     def __len__(self):

TypeError: 'slice' object is not iterable

Solution

  • You are trying to cluster a single document vector (2001th vector to be precise) on this part of your code:

    .fit(model_dm.docvecs[2000]) 
    

    I assume you want the first 2000 vectors?

    Edit:

    After looking at the gensim documentation couldn't see a way to get a slice of document vectors. But looking at the source code DocvecsArray accepts a single key (int or str) or a list of keys. With that you can get the first 2000 vectors using:

    .fit(model_dm.docvecs[range(2000)])
    

    It doesn't look satisfying so I will fix my answer if I can find another way later.

    Also bear in mind these are not the first 2000 vectors since gensim seems to store docvecs as a key:value pair and dictionaries are not ordered.

    Second Edit:

    K-means part of the code also needs to be fixed, you are calling MiniBatchKMeans class' predict function. And give the class instance (mbk) as an argument. You need to call the class instance's (in which case it is mbk) predict function if you need to predict anything else. Which I assume you don't.

    You can get the assigned labels using the code below.

    mbk = MiniBatchKMeans(n_clusters=3, init_size=400, batch_size=300, verbose=1).fit(model_dm.docvecs[range(2000])
    mbk.labels_