Hello I'm trying to tune word2vec for finding related categories on a large set of categories list.
My main problem compare to natural language is that my categories list are not ordered in a logic manner.
For example I have a lists of fruits:
[banana, mango, apple...],
[mango, lemon, pineapple...]
Let's assume mango usually comes in the same list as banana. I want the model to detect this relationship such that when I call most_similar to mango I'll get banana first.
The problem is that the order of the fruit is meaningless. Mango and banana distance in the list can differate without any meaning.
I thought to set a very high window so "everything is related to everything" but I'm not sure it's the best approach.
I have a dataset of 12M sentences with 500K unique categories.
What is a good started for aloha rate, window and adjusting the model in general? Does word2vec even fits this?
Setting a very-large window – far larger than any of your texts – does essentially put all words into each others' context-windows. (If your texts are long, it will also significantly increase runtime, especially in skip-gram mode.) You can also use the optional non-default setting in recent Gensim releases shrink_windows=False
tofurther ensure that a step which normally might probabilistically reduce effective window-sizes is skipped.
Whether word2vec will work well for you is something best answered empirically, by trying it versus other approaches on your data and needs. You're not quite using typical texts, with normal natural-language word frequencies & co-occurrences. But lots of people have found success with word2vec-like approaches on not-quite-real-language tokenized data.
While the default values of parameters are often reasonable starting points, with enough data and a typical task, you may have to vary them even more from the defaults, for optimal results, when using other kinds of data, or pursuing less typical goals.
If using the default negative-sampling on categorical data, you may especially want to look at the optional parameter ns_exponent
– frozen in early implementations at ns_exponent=0.75
, but now adjustable in Gensim – which one paper suggested could have far better values in models for recommendation-systems. (See the class docs for that parameter for a link to that paper.)
But more generally: to find optimal values, you'll need some way to explore many options in an automated fashion. That means some robust repeatable way to score your end model on your real intended end-task (or a close proxy), so that you can run it many different ways then pick the one that scores best. And, parameters like epochs
, vector_size
, window
, negative
, and min_count
are those most-often tuned.
Note though that if all you really want is to get the ranked list of things that most-often come in the same list as a target query – get back 'banana'
for 'mango'
, if and only if 'banana'
co-appears most-often – then you can use a much simpler approach than word2vec: just count, and retain, all the co-occurrences, or all the top-N co-occurrences for each unique key.
A fairly straightforward & usual way to do this in Python would be to use a dict
with one entry per unique key (category), and have the value stored at that key be an instance of the Python utility dictionary Counter
. Iterate once over the whole dataset, tallying every co-occurence.
At the end, when you look up the Counter
for 'mango'
, just list its contents in highest-to-lowest count order. You'll have a precise answer, from a single pass over the data – rather than the 'dense' vectors that an algorithm like word2vec builds over many passes, probabilistically.
(If instead of using sorted Counter
s per key, but sparse vectors, as wide as your count of unique keys, containing counts of each co-occurence in the respective slots-per-category, then you'll have a 'bag-of-words'-like large sparse vector per primary key. Those can also be pairwise compared by cosine-similarity, similar to how dense embeddings like smaller-dimensional word-vectors can be.)