I am doing text classification using scikit-learn following the example in the documentation.
In order to extract features, that is, to convert the text in a set of vectors, the example uses a HashingVectorizer and a TfidfVectorizer vectorizer.
I am doing a stemmatization before the vectorizer in order to handle different stems of the same word. That is, I would like "running" and "run" to be mapped to the same vectors.
I wonder if there is an advantage in using as a vectorizer a word2vec model instead. I thought that this would allow me to handle synonyms, that is, to map different words that have the same meaning to vectors very near between each other in the vector space.
Is my reasoning correct, or the following KMeans alorithm for clusterization will handle synonyms for me?
Yes, word2vec-based-features sometimes offer an advantage.
But whether & how it can help will depend on your exact data/goals, and the baseline results you've achieved before trying word2vec-enhanced approaches. And those aren't described or shown in your question.
The scikit-learn
example you report as your model doesn't integrate any word2vec features. What happens if you add such features? (As one very clumsy but simple example, what if you either replace, or concatenate into, the HashingVectorizer
features a vector that's the average of all a text's word-vectors.)
Do the results improve, by either some quantitative score or a rough eyeballed review?