the following questions refers to the implementation of Word2Vec and Doc2Vec algorithms provided by the great gensim package.
I know similar questions have been asked, however, I feel the given answers seem not to be the best solution for my use-case.
I have a large corpus of 110,000 financial reports with an average length of approx. 30,000 tokens. My goal is to train word vectors first. In a next step, I want to infer doc vectors on sentence level and examine if the vector is similar to the average vector of topic words, e.g., sustainability, environmental, emissions.
My first idea was to use the possibility to train word vectors and doc vectors at the same time. However, if I split the reports in sentences, multiple millions of sentences (documents) result which exceeds my memory (32GB) for saving the arrays of words and documents.
The next idea is to treat every report as a single document for training. I read on github that documents only are trained up to a token limit of 10,000 words but I can split a document into parts of 10,000 token size and use the same tag. So far, this results in a trained model which gives me the ability to train meaningful (in the sense they learned word similarity) word vectors and use the infer_vector method later to infer doc vectors for individual sentences. However, it does not feel as a very good solution because I first train a large number of document vectors which are not used for anything.
My desired goal would be to train a Word2Vec model first, use the word vectors for an "empty" Doc2Vec model which gives me access to the infer_vector method when needed. My understanding is, that this is not easily possible because no pre-trained word vectors can be inizialized for a Doc2Vec model, right? I know this is not necessary under common use-cases related to Doc2Vec, but I hope with this question I could clarify why it would make sense in my case.
I also would appreciate guidance how I could use the internal C-functions which are used by the infer_vector method for training a single docvec given word vectors, unfortunately, I have no C experience at all.
Any help or advice would be highly appreciated, and to be honest I hope Gordon Mohr or someone else from the gensim team might read this;)
Best regards Ralf
The Doc2Vec
algoithm, called "Paragraph Vector" in the papers that introduced it, is not initialized from external pretrained word-vectors, nor is creating word-vectors a distinct 1st step of creating a Doc2Vec
model from scratch that could somehow be done separately, or cached/reused across runs. So not even the internal inference routines can do anything with just some external word-vectors - they depend on model weights separate from word-vectors, learned from doc-to-word relations seen in training.
(I've occasionally seen some variants/improvised-changes that move a bit in the direction of taking outside word-vectors, but I've not seen evidence such variations outperform the usual approach, and they're not implemented in Gensim.)
In standard Dov2Vec
, rather that taking word-vectors as an input, if the chosen mode of Doc2Vec
creates typical per-word word-vectors at all, they get co-trained simultaneously with the doc-vectors.
In the plain "PV-DBOW" mode – dm=0
– no typical word-vectors are trained at all, only doc-vectors & the support for inferencing new doc-vectors. This mode is thus pretty fast and often works quite well for broad topical similarity for short docs of dozens to hundreds of words – because the only thing training is trying to do is predict in-doc words from candidate doc-vectors. In this mode, the window
parameter is meaningless - every word in a doc affects its doc-vector.
You can optionally add to that PV-DBOW mode interleaved skip-gram word-vector training, by using the non-default dbow_words=1
parameter. This co-training, using a shared output (center word) prediction layer, forces the word-vectors & doc-vectors into a shared coordinate system – so that they're directly comparable to each other. The window
parameter then affects the skip-gram word-to-word training, just like in word2vec skip-gram training. Training takes longer, by a factor of about the window
value – and in fact the model is spending more total computation making the words predict their neighbors than the doc-vector predicting the doc words. So there's a margin at which improving the word-vectors may be 'crowding out' improvement of the doc-vectors.
The PV-DM mode – the default dm=1
parameter – inherently uses a combo of condidate doc-vector & neighbor words to predict each center word. That makes window
relevant, and inherently puts the word-vectors & doc-vectors into a shared comparable coordinate space, without as much overhead for larger window
values as the interleaved skip-gram above. There may still be some reduction in doc-vector expressiveness to accomodate all the word-to-word influences.
Which is best for a particular set of docs, subject domain, and intended downstream use is really a matter for experimentation. As you've mentioned comparing doc-vectors to word-vectors is an aim, only the latter two modes above – PV-DBOW with optional skip-gram, or PV-DM – would be appropriate. (But if you don't absolutely need that, & have time to run more comparisons, I'd still recommend trying plain PV-DBOW for its speed & strength in some needs.)
Let's assume your sentences are an average of 20 tokens each, so your 110k docs * 30k tokens / (20 tokens/sentence) give you 165 million sentences. Yes, holding (say) 300-dimensional doc-vectors (1200 bytes each) in-training for 165 million texts has prohibitive RAM costs: 198 GB.
As you've noted, you could use a model trained for only the 110k docs to then infer doc-vectors for other smaller texts, liek the sentences. You shouldn't worry about those 'wasted' 110k doc-vectors: they were necessary to create the inferencing capability, you could throw them away after training (& inference will still work), and maybe you will have some reason to compare words or sentences or new docs or other doc-fragments to those full-doc vectors.
You could also consider training on chunks larger than sentences but smaller than your full docs, like paragraphs or sections, if you can segment docs that way. You could conceivably even use arbitrary Ntoken chunks, and it might work well - only way to know is to try. This sort of algorithm isn't super–sensitive to small changes in tokenization/text-segmenting, as it's the bulk of the data, and broad relationships, it's modeling.
You can also simultaneously train doc-vectors for different levels of text, by supplying more than one 'tag' (key for looking up the doc-vector post-training) per example text. That is, if your full document with ID d1
has 3 distinct sections d1s1
, d1s2
, d1s3
, you could feed it the doc as 3 texts: the 1st section with tags ['d1', 'd1s1']
, the 2nd with tags ['d1', 'd1s2']
, the 3rd with tags ['d1', 'd1s3']
. Then all the texts contribute to the train-tuning of the d1
doc-vector, but the subsections only affect the respective subsection-vectors. Whether that'd be appropriate depends on your goals – & the effect of supplying multiple tags varies a bit between modes – but it may also be worth some experiments.