Word2Vec - same context word different label

The link is the tutorial : https://www.tensorflow.org/tutorials/text/word2vec Here the sentence is "The wide road shimmered in the hot sun". I have two questions.

It uses negative sampling but here, the same context word shimmered has two different labels and the target word "road" is also for the negative sampling.
I heard that skip-gram has weight on nearby words that is the context word close to the target word will have a higher weight but it does not show in this tutorial. I wonder which one is correct for skip-gram. Thanks

I have tried to use Chat-Gpt and other skip-gram tutorial.

Solution

Regarding your (1) question:

Because the negative-sampling candidates are chosen randomly from the whole vocabulary, there is always a chance that the center target word (here 'road') or the positive-example nearby-context-word (here 'shimmering') will also be chosen as negative examples.

In a tiny toy-sized example like this, with a total vocabulary of just 8 words, that chance is pretty high!

By contrast, in typical vocabularies of tens- to hundreds-of-thousands of words – which are required to have a model that really works rather than just illustrates the process – such random picks of the target or positive word are very rare.

But in practice, this has negligible effect on the final results. The positive word 'shimmering' often won't be in the negative examples. And even when it is, its appearance as the positive example will, on net, still nudge its vector differently than the other negative-examples.

And in a realistically-sized training corpus, with many examples of each word's usage in context, and with multiple training epochs, it's even less of a problem. Even if the intended positive example appears, by bad luck, many times among the randomly-drawn negative examples, overall, the necessary meaning-influenced distinguishing updates to the model still happen. So typical implementations don't even worry about the rare cases where the same positive word is drawn as a negative example as well: to try to check for, and special handle, that rare case is more expensive than any theoretical slight benefit it could offer.

(Also note: 'road' is a perfectly legitimate negative example for a center target word 'road' – most words don't appear near themselves, and for the few that do, the ('road', 'road') skip-gram will appear many times, tugging the 'road' vector appropriately.)

Regarding your (2) question:

Typical full-featured word2vec implementations do essentially weight nearer words more heavily, but in a somewhat non-intuitive way, for efficiency. No matter what window size you've specified, the actual window that it uses, for a given center target word, will be some random value from 1 to the window size you've chosen. That is, for a window=3, it will actually sometimes use an effective value of window=1, and sometimes window=2, and sometimes window=3.

This ensures that the immediate nearest-neighbors are always used for skip-gram pairs, while those that are a full window positions away are only used 1/window of the time. This has roughly the same effect, over many texts & epochs, as if they were each assigned a scaling factor – but by performing less calculation, rather than more calculation (every position every time, with a multiplicative scaling factor).

The original word2vec.c reference code from the Google researchers who published the word2vec algorithm did this – calling the effective, reduced window value b – and the Python gensim library re-implementation based on that code keeps these effective-window values in a array called reduced_windows.

I'd guess that tensorflow word2vec implementations that are more complete also do this. As you've noted, this small illustrative tutorial – which notes at its top that it is not an exact implementation of the usual approach – doesn't seem to do any such weighting (via dynamic-window-shrinking or otherwise).

Where such window-shrinking is used, it's used for both the skip-gram and continuous-bag-of-words modes.

Update: regarding your comment questions about ordering issues:

Note that data that's not real natural language text – like say, lists of co-ordered products – may not have the same token-frequency distributions, & co-occurrence patterns, as natural-language. The initial written-up uses & evaluations were mainly on real natural-language text.

So while word2vec has since been shown to often work well on such other token-streams, remember it's a bit of a different domain that examples that use meaningful natural-language sentences/paragraphs/etc – so may benefit from trying quite different parameters.

In particular, early word2vec implelmentations hard-coded the negative-sampling exponent parameter at a value of 0.75, which workd well in the early natural-langauge experiments. (In the article you link, this parameter is named ɑ, but in Gensim it's called ns_exponent.)

There's been work (footnote 14 in the article you link) suggesting that in other domains, like recommendation, values very different from 0.75 may be better – even negative values. Both -0.5 & 1.0 worked better on some of that paper's evaluations than the old 0.75 hardcoded default.

I say all this to highlight: if working with natural-language, you may not want to tinker with ns_exponent much if at all, but using word2vec further afield, trying a wider range of values makes sense.

As you note, for tokens where ordering isn't meaningful – perhaps because the logs/DB returned them in some random, or sorted, order that's arbitrary with respect to the original generating user's actual actions – you also might make different choices for window & other parameters.

For example, Gensim's recent versions offer a shrink_windows parameter that, with its default True value, does the usual dynamic shrinking of effective windows to essentially weight nearer words higher. If in your data, you're sure the nearness of tokens is truly irrelevant, you can set shrink_windows=False, and the full window size will be used every time.

And if you set window to be some number larger tha twice your largest 'text' (in token count), you'd ensure that for each text, every word is always in every other word's context window. (If your texts are long, that could be expensive, but it may also better truly match the meaning of your token-sets.)

Now, with regard to the choice of CBOW or skip-gram, both of them are influenced by word nearness – whether words appear within the window or not, and (in the default shrink_windows=True) also whether they're near-neighbors or farther-neighbors. But neither are sensitive to exact orderings in a manner that really learns anything from phrases or grammar. While one or the other might work better, or faster, under certain conditions, neither is inherently more ordering-oblivious than the other.

If you have time & a way to score them against each other, they could both be worth trying, and evaluating against each other, but none is necessarily better on 1st principles. The article likely just picked skip-gram because it usually does OK, is probably a little simpler to explain, & it wasn't interested in discussing/checking both modes.

Finally, if your token-sets don't have a meaningful ordering, but might have been given some forced order – say by an alphabetical sort – as part of prepping the corpus, and if either you leave nearer-wieghting on or use a window value smalelr than any of your token-sets, then your model may learn spurious associations. For example, two alphabetically-near product-names would appear more often in each others' context-windows, or be have disproportionate influence on each other by their artifical, process-created nearness. The two different ways to fight this are to either:

ensure the windows are large enough, an shrink_windows is off, that to the algorithm all tokens in a single text are equally-near; or:
randomly-shuffle the tokens inside each text at least once before training, so that even windows that don't cover the whole tesxt aren't oversampling false-associations.

(This was a lengthier explanation of the the same considerion that your linked article mentions as "an additional theoretical note" in the "Window size" section.)