I am dealing with a food similarity problem in which I need to find similar Indian food for a given Indian dish. So, Can anyone help me dealing with this problem effectively? is using word2vec fine for this problem?
For this task, I started with finding vectors for ingredients and then apply tf-idf weighted average on ingredients to get vectors for dishes. I scraped data for ingredients of different dishes then I applied wor2vec but I didn't find satisfactory results.
#Setting values for NN parameters
num_features = 300 # Word vector dimensionality
min_word_count = 3
num_workers = 4 # Number of CPUs
context = 10 # Context window size ie. avg recipe size
downsampling = 1e-3 # threshold for configuring which
# higher-frequency words are randomly downsampled
#Initializing and training the model
model = word2vec.Word2Vec(sentences, workers=num_workers, \
size=num_features, min_count = min_word_count, \
window = context, sample = downsampling) '''
#using init_sims to make the model much more memory-efficient.
model.init_sims(replace=True)
model.most_similar('ginger')
Output :
[('salt', 0.9999704957008362),
('cloves garlic', 0.9999628067016602),
('garam masala', 0.9999610781669617),
('turmeric', 0.9999603033065796),
('onions', 0.999959409236908),
('vegetable oil', 0.9999580383300781),
('coriander', 0.9999570250511169),
('black pepper', 0.9999487400054932),
('cumin seeds', 0.999948263168335),
('green chile pepper', 0.9999480247497559)]
Word2vec may be reasonable for this task. You might need more data, or parameter tweaks, to get the best results.
It's not obvious to me what's wrong with your example results, so you should add more detail to your question, and more examples/explanation, about why you're unsatisfied with the results.
If you have certain ideal-results, that you can collect into repeatable model tests, that would help you to tune the model. For example, if you know that "cinnamon" should be a better match for "ginger" than "salt", you would encode that (and dozens or hundreds or thousands of other "preferred answers") into an automated evaluation method that could score a model.
Then, you could adjust ("meta-optimize") model parameters to find a model that scores best on your evaluation.
size
) could help, depending on the richness of your datawindow
might helpepochs
might helpmin_count
(discarding more low-frequency words) often helps, especially with larger datasetssample
value (smaller, such as 1e-04
or 1e-05
) can help, with very large datasetsns_exponent
values may help, especially for recommendation-applications