Search code examples
deep-learningsimilarityinformation-retrievaldoc2vecdocument-classification

What is the best way to represent a collection of documents in a fixed length vector?


I am trying to build a deep neural networks that takes in a set of documents and predicts the category it belongs.

Since number of documents in each collection is not fixed, my first attempt was to get a mapping of documents from doc2vec and use the average.

The accuracy on training is high as 90% but the testing accuracy is low as 60%.

Is there a better way of representing a collection of documents as a fixed length vector so that the words they have in common are captured?


Solution

  • The description of your process so far is a bit vague and unclear – you may want to add more detail to your question.

    Typically, Doc2Vec would convert each doc to a vector, not "a collection of documents".

    If you did try to collapse a collection into a single vector – for example, by averaging many doc-vecs, or calculating a vector for a synthetic document with all the sub-documents' words – you might be losing valuable higher-dimensional structure.

    To "predict the category" would be a typical "classification" problem, and with a bunch of documents (represented by their per-doc vectors) and known-labels, you could try various kinds of classifiers.

    I suspect from your description, that you may just be collapsing a category to a single vector, then classifying new documents by checking which existing category-vector they're closest-to. That can work – it's vaguely a K-Nearest-Neighbors approach, but with every category reduced to one summary vector rather than the full set of known examples, and each classification being made by looking at just one single nearest-neighbor. That forces a simplicity on the process that may not match the "shapes" of the real categories as well as a true KNN classifier, or other classifiers, could achieve.

    If accuracy on test data falls far below that observed during training, that can indicate that significant "overfitting" is occurring: the model(s) are essentially memorizing idiosyncrasies of the training data to "cheat" at answers based on arbitrary correlations, rather than learning generalizable rules. Making your model(s) smaller – such as by decreasing the dimensionality of your doc-vectors – may help in such situations, by giving the model less extra state in which to remember peculiarities of the training data. More data can also help - as the "noise" in more numerous varied examples tends of cancel itself out, rather than achieve the sort of misguided importance that can be learned in smaller datasets.

    There are other ways to convert a variable-length text into a fixed-length vector, including many based on deeper learning algorithms. But, those can be even more training-data-hungry, and it seems like you may have other factors to improve before trying those in-lieu-of Doc2Vec.