I have a dataset off millions of arrays like follows:
sentences=[
[
'query_foo bar',
'split_query_foo',
'split_query_bar',
'sku_qwre',
'brand_A B C',
'split_brand_A',
'split_brand_B',
'split_brand_C',
'color_black',
'category_C1',
'product_group_clothing',
'silhouette_t_shirt_top',
],
[...]
]
where you find a query, a sku that was acquired by the user doing the query and a few attributes of the SKU. My idea was to do a very basic model based on word2vec where I could find similar things together.
In a simple way, if I search for t-shirt
on the model I would expect to have t-shirt SKUs near the query.
I try to use gensim (I'm new to this library) with different attributes to build a model:
from gensim.models.callbacks import CallbackAny2Vec
class callback(CallbackAny2Vec):
'''Callback to print loss after each epoch.'''
def __init__(self):
self.epoch = 0
self.loss_to_be_subed = 0
def on_epoch_end(self, model):
loss = model.get_latest_training_loss()
loss_now = loss - self.loss_to_be_subed
self.loss_to_be_subed = loss
print('Loss after epoch {}: {}'.format(self.epoch, loss_now))
self.epoch += 1
model = Word2Vec(
sentences=sentences,
vector_size=100,
window=1000,
min_count=2,
workers=-1,
epochs=10,
# negative=5,
compute_loss=True,
callbacks=[callback()]
)
I got this output:
Loss after epoch 0: 0.0
Loss after epoch 1: 0.0
Loss after epoch 2: 0.0
Loss after epoch 3: 0.0
Loss after epoch 4: 0.0
Loss after epoch 5: 0.0
Loss after epoch 6: 0.0
Loss after epoch 7: 0.0
Loss after epoch 8: 0.0
Loss after epoch 9: 0.0
All losses of 0!!! I start to get very suspicious at this point.
Note: Each element of sentences
are independent, I hop the library don't try to mix different terms in different arrays.
For trying to test the model, I tried a very frequent query like model.wv.most_similar('query_t-shirt', topn=100)
and the results are completely absurd.
Is my idea crazy or am I using incorrectly the library?
workers=-1
is not a valid parameter value. If there's an example suggesting that negative-count somewhere, it's a bad example. If you got the impression that would work from something in Gensim's official docs, please report that documentation as a bug to be fixed.
More generally: enabling logging at the INFO
level will show a lot more detail about what's happening, and something like "misguided parameter that prevents any training from happening" may become more obvious when using such logging.
Separately:
The Word2Vec
loss-tracking of Gensim has a lot of open issues (including the failure to tally by epoch, which your Callback
tries to correct). I'd suggest not futzing with loss-display unless/until you've already achieved some success without it.
Such a low min_count=2
is usually a bad idea with the word2vec algorithm, at least in normal natural-language settings. Words with so few occurrences lack the variety of contrasting usage examples to achieve a generalizable word-vector, or, individually, to influence the model much compared to the far-more-numerous other words. But such rare words are, altogether, quite numerous - essentially serving as 'noise' worsening other words. Discarding more such rare words often improves the remaining words, and overall model, noticeably. So, if you have enough raw training data to make word2vec worth applying, it should be more common to increase this cutoff higher than the default min_count=5
than reduce it.
For recommendation-like systems being fed by pseudotexts that aren't exactly natural-language-like, it may be especially worthwhile to experiment with the ns_exponent
parameter. As per the research paper linked in the class docs, the original ns_exponent=0.75
value, which was an unchangeable constant in early word2vec implementations, might not be ideal for other applications like recommender systems.