Search code examples
nlpdata-scienceword2vec

Cluster similar words using word2vec


I have various restaurant labels with me and i have some words that are unrelated to restaurants as well. like below:

vegan vegetarian pizza burger transportation coffee Bookstores Oil and Lube

I have such mix of around 500 labels. I want to know is there a way pick the similar labels that are related to food choices and leave out words like oil and lube, transportation.

I tried using word2vec but, some of them have more than one word and could not figure out a right way.

Brute-force approach is to tag them manually. But, i want to know is there a way using NLP or Word2Vec to cluster all related labels together.


Solution

  • Word2Vec could help with this, but key factors to consider are:

    • How are your word-vectors trained? Using off-the-shelf vectors (like say the popular GoogleNews vectors trained on a large corpus of news stories) are unlikely to closely match the senses of these words in your domain, or include multi-word tokens like 'oil_and_lube'. But, if you have a good training corpus from your own domain, with multi-word tokens from a controlled vocabulary (like oil_and_lube) that are used in context, you might get quite good vectors for exactly the tokens you need.

    • The similarity of word-vectors isn't strictly 'synonymity' but often other forms of close-relation including oppositeness and other ways words can be interchangeable or be used in similar contexts. So whether or not the word-vector similarity-values provide a good threshold cutoff for your particular desired "related to food" test is something you'd have to try out & tinker around. (For example: whether words that are drop-in replacements for each other are closest to each other, or words that are common-in-the-same-topics are closest to each other, can be influenced by whether the window parameter is smaller or larger. So you could find tuning Word2Vec training parameters improve the resulting vectors for your specific needs.)

    Making more recommendations for how to proceed would require more details on the training data you have available – where do these labels come from? what's the format they're in? how much do you have? – and your ultimate goals – why is it important to distinguish between restaurant- and non-restaurant- labels?