neural-network nlp text-mining word2vec word-embedding

Construct word2vec (CBOW) training data from beginning of sentence

When constructing training data for CBOW, Mikolov et al. suggest using the word from the center of a context window. What is the "best" approach to capturing words at the beginning/end of a sentence (I put best in quotes because I'm sure this depends on the task). Implementations I see online do something like the this:

for i in range(2, len(raw_text) - 2):
    context = [raw_text[i - 2], raw_text[i - 1],
               raw_text[i + 1], raw_text[i + 2]]

I see two issues arising from this approach.

Issue 1: The approach gives imbalanced focus to the middle of the sentence. For example, the first word of the sentence can only appear in 1 context window and will never appear as the target word. Compare this to the 4th word in the sentence which will appear in 4 context windows and will also be a target word. This will be an issue as some words appear frequently at the beginning of sentences (i.e. however, thus, etc.). Wouldn't this approach minimize their use?
Issue 2: Sentences with 4 or fewer words are completely ignored, and the importance of short sentences is minimized. For example, a sentence with 5 words can only contribute one training sample while a sentence of length 8 will contribute 4 training samples.

Can anyone offer insight as to how much these issues affect the results or any alternative approaches for constructing the training data? (I considered letting the first word be the target word and using the next N words as the context, but this creates issues of it's own).

Related question on Stack Exchange: Construct word2vec (CBOW) training data from beginning of sentence

Solution

All actual implementations I've seen, going back to the original word2vec.c by Mikolov, tend to let every word take turns being the 'center target word', but truncate the context-window to whatever is available.

So for example, with a window=5 (on both sides), and the 'center word' as the 1st word of a text, only the 5 following words are used. If the center word is the 2nd word, 1 word preceding, and 5 words following, will be used.

This is easy to implement and works fine in practice.

In CBOW mode, every center word is still part of the same same number of neural-network forward-propagations (roughly, prediction attempts), though words 'near the ends' participate as inputs slightly less often. But even then, they're subject to an incrementally larger update - such as when they're 1 of just 5 words, instead of 1 of just 10.

(In SG mode, words near-the-ends will both inputs and target-words slightly less often.)

Your example code – showing words without full context windows never being the center target – is not something I've seen, and I'd only expect that choice in a buggy/unsophisticated implementation.

So neither of your issues arise in common implementations, where texts are longer than 1 word long. (In even a text of 2 words, the 1st word will be predicted using a window of just the 2nd, and the 2nd will be predicted with a window of just the 1st.)

While the actual word-sampling does result in slightly-different treatment of words at either end, it's hard for me to imagine these slight differences in word-treatment making any difference in results, in appropriate training corpuses for word2vec – large & varied with plentiful contrasting examples for all relevant words.

(Maybe it'd be an issue in some small or synthetic corpus, where some rare-but-important tokens only appear in leading- or ending-positions. But that's far from the usual use of word2vec.)

Note also that while some descriptions & APIs describe the units of word2vec training as 'sentences', the algorithm really just works on 'lists of tokens'. Often each list-of-tokens will span paragraphs or documents. Sometimes they retain things like punctuation, including sentence-ending periods, as pseudo-words. Bleeding the windows across sentence-boundaries rarely hurts, and often help, as the cooccurrences of words leading out of one sentence and into the next may be just as instructive as the cooccurrences of words inside one sentence. So in common practice of many-sentence training text, even fewer 'near-the-ends' words have even a slightly-different sampling treatment that you may have thought.