I'd like to train doc2vec on items consisting of 2 bits of information: a) text (in legal domain) b) keywords and/or references to other legal texts extracted from the text I want my model to be able to identify similar texts according to, basically, 2 criteria: a) textual similarity and b) presence of keywords/references
Are there any best practices for a case such as this? My ideas so far: - join text and keywords/referenes into a single string and train a model on that - train two independent models (two vectors will be produced: for text and for keywords
I'm assuming by 'doc2vec' you mean the gensim
implementation of the 'Paragraph Vector' algorithm, in the class Doc2Vec
.
Both of your approaches might work and could be worth testing. There's no facility in the Doc2Vec
class for feeding distinctly "other" data in, but you can make that data look like extra word-tokens, or extra tags
, and thus have the cross-correlations of those other values affect, and be embedded within, the resulting vector-space.
Specifically, if you want your "keywords and/or references" to be modeled alongside the whole text, and not just the normal-words they might happen-to-be-next-to (if they were appended to the text), you should especially try one or both options of:
using the PV-DBOW mode (dm=0
), which does not use word-to-nearby-word influences (within a context window
)
placing the keywords or references as extra tags
, in addition to unique-to-the-document ID tag (that's the classic way of naming doc-vectors)
(If trying two separate models, you might have the model based on natural texts still use PV-DM modes affected by a window
, while the essentially unordered-nature of keywords/references would use a PV-DBOW mode.)