Search code examples
textnlpword2vec

When to use Word2vec and bag of words?


I'm still unsure about when to use word2vec and when to rely on the bag of words. For example, if I want to develop a text clustering model that takes text as an input and outputs a cluster for each input, should I care about the word representation and use word2vec or should I rely on the bag of words and treat the input text as a document? Please share any more reading and understanding resources with me; I'm very interested in text preprocessing and clustering and want to learn everything I can about it.

Furthermore, if I want to use k-Means for the clustering, should I split the data or it's okay to just work with the whole data in one?


Solution

  • There's no hard rules. Generally, for any set of techniques you consider plausibly-appropriate, & within your skills/budget, you try them all against your specific data & task, and pick the ones that go better.

    (You might develop some vague intuitions over time about situations where certain approaches are more likely to reflect the 'essential' parts of your task - but they can hadly be communicated in a StackOverflow answer over all possibilities.)

    If you've tried specific things & been surprised or disappointed by the result, that might create a more-answerable question, where you supply the specifics of your data/task, & what you've tried, & what your results are, and ask about specific unexpected behaviors, or specific aspects you'd want corrected/improved.