Search code examples
nlpdoc2vec

Dataset for Doc2vec


I have a question is there already any free dataset available to test doc2vec and if in case I wanted to create my own dataset what could be an appropriate way to do it.


Solution

  • Assuming you mean the 'Paragraph Vectors' algorithm, which is often called Doc2Vec, any textual dataset is a potential test/demo dataset.

    The original papers by the creators of Doc2Vec showed results from applying it to:

    • movie reviews
    • search engine summary snippets
    • Wikipedia articles
    • scientific articles from Arxiv

    People have also used it on…

    • titles of articles/books
    • abstracts of larger articles
    • full news articles or scientific papers
    • tweets
    • blogposts or social media posts
    • resumes

    When learning, it's best to pick very simple, common datasets when you're 1st starting, and then larger datasets that you somewhat understand or are related to your areas of interest – if you don't already have a sufficient project-related dataset.

    Note that the algorithm, like others in the [something]2vec family of algorithms, works best with lots of varied training data – many tens of thousands of unique words each with many contrasting usage examples, over many tens of thousands (or many more) of documents.

    If you crank the vector_size way down, & the training-epochs way up, you can eke some hints of its real performance out of smaller datasets of a few hundred contrasting documents. For example, in the Python Gensim library's Doc2Vec intro-tutorial & test-cases, a tiny set of 300 news-summaries (from about 20 years ago called the 'Lee Corpus') are used, and each text is only a few hundreds words long.

    But the vector_size is reduced to 50 – much smaller than the hundreds-of-dimensions typical with larger training data, and perhaps still too many dimensions for such a small amount of data. And, the training epochs is increased to 40, much larger than the default of 5 or typical Doc2Vec choices in published papers of 10-20 epochs. And even with those changes, with such little data & textual variety, the effect of moving similar documents to similar vector coordinates will be appear weaker to human review, & be less consistent between runs, than a better dataset will usually show (albeit using many more minutes/hours of training time).