I am training my own embedding vectors as I'm focused on an academic dataset (WOS); whether the vectors are generated via word2vec or fasttext doesn't particularly matter. Say my vectors are 150 dimensions each. I'm wondering what the desired distribution of weights within a vector ought to be, if you averaged across an entire corpus's vectors?
I did a few experiments while looking at the distributions of a sample of my vectors and came to these conclusions (uncertain as to how absolutely they hold):
If one trains their model with too few epochs then the vectors don't change significantly from their initiated values (easy to see if you start you vectors as weight 0 in every category). Thus if my weight distribution is centered around some point (typically 0) then I've under-trained my corpus.
If one trains their model with too few documents/over-trains then the vectors show significant correlation with each other (I typically visualize a random set of vectors and you can see stripes where all the vectors have weights that are either positive or negative).
What I imagine is a single "good" vector has various weights across the entire range of -1 to 1. For any single vector it may have significantly more dimensions near -1 or 1. However, the weight distribution of an entire corpus would balance out vectors that randomly have more values towards one end of the spectrum or another, so that the weight distribution of the entire corpus is approximately evenly distributed across the entire corpus. Is this intuition correct?
I'm unfamiliar with any research or folk wisdom about the desirable "weights of the vectors" (by which I assume you mean the individual dimensions).
In general, since the individual dimensions aren't strongly interpretable, I'm not sure you could say much about how any one dimension's values should be distributed. And remember, our intuitions from low-dimensional spaces (2d, 3d, 4d) often don't hold up in high-dimensional spaces.
I've seen two interesting, possibly relevant observations in research:
some have observed that the raw trained vectors for words with singular meanings tend to have a larger magnitude, and those with many meanings have smaller magnitudes. A plausible explanation for this would be that word-vectors for polysemous word-tokenss are being pulled in different directions for the multiple contrasting meanings, and thus wind up "somewhere in the middle" (closer to the origin, and thus of lower magnitude). Note, though, that most word-vector-to-word-vector comparisons ignore the magnitudes, by using cosine-similarity to only compare angles (or largely equivalently, by normalizing all vectors to unit length before comparisons).
A paper "All-but-the-Top: Simple and Effective Postprocessing for Word Representations" by Mu, Bhat, & Viswanath https://arxiv.org/abs/1702.01417v2 has noted that the average of all word-vectors that were trained together tends to biased in a certain direction from the origin, but that removing that bias (and other commonalities in the vectors) can result in improved vectors for many tasks. In my own personal experiments, I've observed that the magnitude of that bias-from-origin seems correlated with the number of negative
samples chosen - and that choosing the extreme (and uncommon) value of just 1 negative sample makes such a bias negligible (but might not be best for overall quality or efficiency/speed of training).
So there may be useful heuristics about vector quality from looking at the relative distributions of vectors, but I'm not sure any would be sensitive to individual dimensions (except insofar as those happen to be the projections of vectors onto a certain axis).