I am attempting to use gensim's word2vec to transform a column of a pandas dataframe into a vector that I can pass to a sklearn
classifier to make a prediction.
I understand that I need to average the vectors for each row. I have tried following this guide but I am stuck, as I am getting models back but I don't think I can access the underlying embeddings to find the averages.
Please see a minimal, reproducible example below:
import pandas as pd, numpy as np
from gensim.models import Word2Vec
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.feature_extraction.text import CountVectorizer
temp_df = pd.DataFrame.from_dict({'ID': [1,2,3,4,5], 'ContData': [np.random.randint(1, 10 + 1)]*5,
'Text': ['Lorem ipsum dolor sit amet', 'consectetur adipiscing elit.', 'Sed elementum ultricies varius.',
'Nunc vel risus sed ligula ultrices maximus id qui', 'Pellentesque pellentesque sodales purus,'],
'Class': [1,0,1,0,1]})
temp_df['text_lists'] = [x.split(' ') for x in temp_df['Text']]
w2v_model = Word2Vec(temp_df['text_lists'].values, min_count=1)
cv = CountVectorizer()
count_model = pd.DataFrame(data=cv.fit_transform(temp_df['Text']).todense(), columns=list(cv.get_feature_names_out()))
Using sklearn's CountVectorizer
, I am able to get a simple frequency representation that I can pass to a classifier. How can I get that same format using Word2vec?
This toy example produces:
adipiscing amet consectetur dolor elementum elit id ipsum ligula lorem ... purus qui risus sed sit sodales ultrices ultricies varius vel
0 0 1 0 1 0 0 0 1 0 1 ... 0 0 0 0 1 0 0 0 0 0
1 1 0 1 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 1 0 0 0 1 1 0
3 0 0 0 0 0 0 1 0 1 0 ... 0 1 1 1 0 0 1 0 0 1
4 0 0 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 1 0 0 0 0
While this runs without error, I cannot access the embedding that I can pass with this current format. I would like to produce the same format, with the exception of instead of there being counts, its the word2vec
value embeddings
While yo might not be able to help it if your original data comes from a Pandas DataFrame
, neither Gensim nor Scikit-Learn work with DataFrame
-style data natively. Rather, they tend to use raw numpy
arrays, or base Python datastructures like list
s or iterable sequences.
Trying to shoehorn interim raw vectors into the Pandas style of data structure tends to add code complication & wasteful overhead.
That's especially true if the vectors are dense vectors, where essentially all of a smaller-number of dimensions are nonzero, as in word2vec-like algorthms. But that's also true if the vectors are the kinds of sparse vectors, with a giant number of dimensions, but most dimensions 0, that come from CountVectorizer
and various "bag-of-words"-style text models.
So first, I'd recommend against putting the raw outputs of Word2Vec
or CountVectorizer
, which are usually interim representations on the way to completing some other task, into a DataFrame
.
If you want to have the final assigned-labels in the DataFrame
, for analysis or reporting in the Pandas style, only add those final outputs in the end. But to understand the interim vector representations, and then to pass them to things like Scikit-Learn classifiers in the formats those classes expect, keep those vectors (and inspect them yourself for clarity) in the their raw numpy
vector formats.
In particular, after Word2Vec
runs (with the parameters you've shown), there'll be a 100-dimensional vector per word. Not per multi-word text. And the 100-dimensions have no names other than their indexes 0 to 99.
And unlike the dimensions of the CountVectorizer
representation, which are counts of individual words, each dimension of the "dense embedding" will be some floating-point decimal value that has no clear or specific interpretation alone: it's only directions/neighborhoods in the whole space, shearing across many dimensions, that vaguely correspond with useful or human-nameable concepts.
If you want to turn the per-word 100-dimensional vectors into vectors for a multi-word text, there are many potential ways to do so – but one simple choice is to simply average together the N word-vectors into 1 summary vector. Gensim's class holding the word-vectors inside the Word2Vec
model, KeyedVectors
, has a .get_mean_vector()
method that can help. For example:
texts_as_wordlists = [x.split(' ') for x in temp_df['Text']]
text_vectors = [w2v_model.wv.get_mean_vector(wordlist) for wordlist in texts_as_wordlists]
There are many other potential ways to use word-vectors to model a longer text. For example, you might reweight the words before averagine. But a simple average is a reasonable first baseline approach. (Other algorithms related to word2vec, like the 'Paragraph Vector' algorihtm implemented by the Doc2Vec
class, can also create a vector for a multi-word text, and such a vector is not just the average of its word-vectors.)
Two other notes on using Word2Vec
:
min_count=1
is essentially always a bad idea with this algorithm. Related to the point above, the algorithm needs multiple subtly-contrasting usage examples of any word to have any chance of placing it meaningfully in the shared-coordinate space. Words with just one, or even a few, usages tend to get awful vectors not generalizable to the word's real meaning as would be evident from a larger sample of its use. And, in natural-language corpora, such few-example words are very numerous - so they wind up taking a lot of the training time, and achieving their bad representations actually worsens the vectors for surrounding words, that could be better because there are enough training examples. So, the best practice with word2vec is usually to ignore the rarest words – train as if they weren't even there. (The class's default is min_count=5
for good reasons, and if that results in your model missing vectors for words you think you need, get more data showing uses of those words in real contexts, rather than lowering min_count
.)