Search code examples
machine-learningscikit-learnword2vectext-classificationword-embedding

Use word2vec word embeding as feature vector for text classification (simlar to count vectorizer/tfidf feature vector)


I am trying to perform some text classification using machine learning and for that I have extracted feature vectors from the per-processed textual data using simple bag of words approach(count vectorizer) and tfidf vectorizer.

Now I want to use word2vec i.e. word embedding as my feature vector similar as that of count vectorizer/tfidf vectorizer where I should be able to learn vocabulary from the train data and transform or fit the test data with the learned vocab but I can't find a way to implement that.

//I need something like this with word2vec

count = CountVectorizer()
train_feature_ vector =count.fit_transform(train_data)
test_feature_vector = count.fit(test_data)

//So I can train my model like this
mb = MultinomialNB()
mb.fit(train_feature_vector,y_train)
acc_score = mb.score(test_feature_vector,y_test)
print("Accuracy "+str(acc_score))

Solution

  • You first should understand what Word Embeddings are. When you apply a CountVectorizer or TfIdfVectorizer what you get is a sentence representation in a sparse way, commonly known as a One Hot encoding. The word embeddings representation are used to represent a word in a high dimensional space of real numbers.

    Once you get your per word representation there are some ways to do this, check:How to get vector for a sentence from the word2vec of tokens in sentence