Search code examples
machine-learninggensimdoc2vec

best training methods for binary text classification using doc2vec gensim


I am trying to use doc2vec to do text classification based on document subject, for example, I want to classify all documents about sports as 1 and all other documents as 0. I want to do this by first training a doc2vec model with training data and then use a classification model such as logistic regression to classify the texts as positive or negative.

I have seen various examples online to do this [1,2] which employ different methods and I am unclear about some of the details as to why they are using certain methods, and which method is the best for text classification.

  1. Firstly Using the example above, is it better to train the model using just documents related to sports or documents on all subjects. My thinking was by training just on sports documents you could classify documents based on document similarity(although this wouldnt produce vectors for non sports documents to use to train the next model). Also, i feel like if training the model on all documents you would need a huge amount of documents to represent everything other than sports to get good classification.

  2. Secondly, which features are actually used to train the logistic regression model. If training the model on all documents I assume you would track the documents using an index of some sort and then train the logistic regression model using the vectors with a class label, is this correct ?

  3. Thirdly, I have seen various uses of TaggedDocument where a unique id is put for each document and also where a shared id is used to represent the same class, eg., 1 = sports 0 = non sports. From what I have read a shared id means the model has a single vector representing each class, while using a unique id provides unique vectors for each document, is this correct ?. If so, assuming that I need unique labeled vectors for training the logistic regression model what is the point of using a shared id ? Wouldnt this provide terrible classification results ?

If anyone can help me with the questions above and generally what is the best way to do text classification using doc2vec vectors it would be greatly appreciated.


Solution

  • There's an example included in gensim of using Doc2Vec for sentiment-classification, very close to your need. See:

    https://github.com/RaRe-Technologies/gensim/blob/bcee414663bdcbdf6a58684531ee69c6949550bf/docs/src/gallery/howtos/run_doc2vec_imdb.py

    (It's likely a better model than the other tutorials you link. In particular, the second tutorial you've linked currently has a very erroneous mismanagement of alpha in its misguided loop calling train() multiple times.)

    Specifically with regard to your questions:

    (1) Train with as much data, both inside and outside the desired class, as possible. A model that's only seen "positive" examples is unlikely to generate meaningful vectors from documents totally unlike those it's been trained on. (In particular, with a model like Doc2Vec, it only knows words it's seen during training, and if you later try to infer vectors for new documents with unknown words, those words are ignored entirely.)

    (2) Yes, a classifier (of any algorithm) is fed features and known-labels. It then learns to deduce those labels from those features.

    (3) Traditionally, Doc2Vec is trained with one unique ID 'tag' per document – and no known-label information. So, each document gets its own vector, and the process is totally "unsupervised". It's possible to instead give documents multiple tags, or use the same tag on more than one document. And, you could make those tags match known-labels – so all "sports" docs share 'sports' tag (either in addition to their unique-ID, or instead of it). But, doing this adds a number of other complications over the simple, one-ID-tag-per-document, case. So I wouldn't recommend trying anything in that direction until you've got the simpler case working. (I have seen a few cases where mixing in known-labels as extra tags can help a little, especially in multi-class classification issues, where such extra labels each only apply to a small subset of all documents. But it's not assured – and thus only makes sense to tinker with that after you have a working straightforward baseline, and repeatable way to evaluate alternate models against each other.)