Search code examples
pythongensimword2vecdoc2vec

significance of periods in sentences while training documents with Doc2Vec


Doubt - 1

I am training Doc2Vec with 150000 documents. Since these documents are from legal domain they are really hard to clean and get it ready for further training. Hence I decided to remove all the periods from a document. Having said that, I am confused on how the parameter of Window_size in doc2vec recognize the sentences now. There are two views presented in the question :Doc2Vec: Differentiate Sentence and Document

  1. The algorithm just works on chunks of text, without any idea of what a sentence/paragraph/document etc might be.
  2. It's even common for the tokenization to retain punctuation, such as the periods between sentences, as standalone tokens.

Therefore I am in confusion if my adopted approach of eliminating the punctuation (periods) is right. Kindly provide me with some supportive answers.

Doubt-2

The documents that I scraped range from 500 - 5500 tokens hence my approach to have a pretty even sized documents for training doc2vec and even to reduce the vocabulary is : Consider a document of size greater than 1500 tokens in this case I make use of First 50 to 400 tokens + 600 to 1000 tokens + last 250 tokens. The motivation for this kind of approach is from a paper related to Classification of documents using BERT where the sequence of 512 tokens were generated like this.

So I want to know if this idea is somewhat good to proceed or it's not recommended to do this?

Update - I just saw the common_text corpus used by gensim in the tutorial link https://radimrehurek.com/gensim/models/doc2vec.html and found that the documents in that corpus are simply tokens of words and do not contain any punctuation. eg:

from gensim.test.utils import common_texts, common_dictionary, common_corpus

print(common_texts[0:10])

Output:

[['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ['graph', 'trees'], ['graph', 'minors', 'trees'], ['graph', 'minors', 'survey']]

Same has been followed in the tutorial https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html. So is my approach of removing periods in the document valid, if so then how will the window parameter work because in the documentation it is defined as follows: window (int, optional) – The maximum distance between the current and predicted word within a sentence.


Solution

  • Some people keep periods and other punctuation as standalone tokens, some eliminate them.

    There's no definitively 'right' approach, and depending on your end goals, one or the other might make a slight difference in the doc-vector quality. So for now just do what's easiest for you, and then later if you have time, you can evaluate the alternate approach to see if it helps.

    Despite any reference to 'sentences' in the docs, the Word2Vec/Doc2Vec/etc classes in gensim don't have any understanding of sentences, or special sensitivity to punctuation. They just see the lists-of-tokens you pass in as individual items in the corpus. So if you were to leave periods in, as in a short text like...

    ['the', 'cat', 'was', 'orange', '.', 'it', 'meowed', '.']
    

    ...then the '.' string is just another pseudo-word, which will get a vector, and the training windows will reach through it just like any other word. (And, 'meowed' will be 5 tokens away from 'cat', and thus have some influence if window=5.)

    I don't quite understand what you mean about "make use of First 50 to 400 tokens + 600 to 1000 tokens + last 250 tokens". Doc2Vec works fine up to texts of 10000 tokens. (More tokens than that will be silently ignored, due to an internal implementation limit of gensim.) It's not necessary or typical to break docs into smaller chunks, unless you have some other need to model smaller chunks of text.

    The tiny common_texts set of word-lists is a contrived, toy-sized bit of data to demonstrate some basic code usage - it's not an example of recommended practices. The demos based on the 'Lee' corpus are similarly a quick intro to a tiny and simple approach that's just barely sufficient to show basic usage and results. It's text tokenization – via the simple_preprocess() utility method – is an OK thing to try but not 'right' or 'best' compared to all the other possibilities.