nlp gensim word-embedding doc2vec deeplearning4j

Paragraph Vector or Doc2vec model size

I am using deeplearning4j java library to build paragraph vector model (doc2vec) of dimension 100. I am using a text file. It has around 17 million lines, and size of the file is 330 MB. I can train the model and calculate paragraph vector which gives reasonably good results.

The problem is that when I try to save the model (by writing to disk) with WordVectorSerializer.writeParagraphVectors (dl4j method) it takes around 20 GB of space. And around 30GB when I use native java serializer.

I'm thinking may be the model is size is too big for that much data. Is the model size 20GB reasonable for the text data of 300 MB?

Comments are also welcome from people who have used doc2vec/paragraph vector in other library/language.

Thank you!

Solution

I'm not familiar with the dl4j implementation, but model size is dominated by the number of unique word-vectors/doc-vectors, and the chosen vector size.

(330MB / 17 million) means each of your documents averages only 20 bytes – very small for Doc2Vec!

But if for example you're training up a 300-dimensional doc-vector for each doc, and each dimension is (as typical) a 4-byte float, then (17 million * 300 dims * 4 bytes/dim) = 20.4GB. And then there'd be more space for word-vectors and model inner-weights/vocabulary/etc, so the storage sizes you've reported aren't implausible.

With the sizes you've described, there's also a big risk of overfitting - if using 300-dimensions, you'd be modeling docs of <20 bytes source material as (300*4=) 1200-byte doc-vectors.

To some extent, that makes the model tend towards a giant, memorized-inputs lookup table, and thus less-likely to capture generalizable patterns that help understand training docs, or new docs. Effective learning usually instead looks somewhat like compression: modeling the source materials as something smaller but more salient.