Which part of the algorithm specifically makes the embeddings to have the king - boy + girl = queen
ability? Did they just did this by accident?
Edit :
Take the CBOW as an example. I know about they use embeddings instead of one-hot vectors to encode the words and made the embeddings trainable instead of how we do when using one hot vectors that the data itself is not trainable. Then the output is a one-hot vector for target word. They just average all the surrounding word embeddings at some point then put some lego layers afterwards. So at the end they find the mentioned property by surprise, or is there a training procedure or network structure that gave the embeddings that property?
The algorithm simply works to train (optimize) a shallow neural-network model that's good at predicting words, from other nearby words.
That's the only internal training goal – subject to the neural network's constraints on how the words are represented (N floating-point dimensions), or combined with the model's internal weights to render an interpretable prediction (forward propagation rules).
There's no other 'coaching' about what words 'should' do in relation to each other. All words are still just opaque tokens to word2vec. It doesn't even consider their letters: the whole-token is just a lookup key for a whole-vector. (Though, the word2vec variant FastText varies that somewhat by also training vectors for subwords – & thus can vaguely simulate the same intuitions that people have for word-roots/suffixes/etc.)
The interesting 'neighborhoods' of nearby words, and relative orientations that align human-interpretable aspects to vague directions in the high-dimensional coordinate space, fall out of the prediction task. And those relative orientations are what gives rise to the surprising "analogical arithmetic" you're asking about.
Internally, there's a tiny internal training cycle applied over and over: "nudge this word-vector to be slightly better at predicting these neighboring words". Then, repeat with another word, and other neighbors. And again & again, millions of times, each time only looking at a tiny subset of the data.
But the updates that contradict each other cancel out, and those that represent reliable patterns in the source training texts reinforce each other.
From one perspective, it's essentially trying to "compress" some giant vocabulary – tens of thousands, to millions, of unique words – into a smaller N-dimensional representation - usually 100-400 dimensions when you have enough training data. The dimensional-values that become as-good-as-possible (but never necessary great) at predicting neighbors turn out to exhibit the other desirable positionings, too.