Search code examples
pythonscikit-learnnlpgensimword2vec

Converting word2vec output into dataframe for sklearn


I am attempting to use gensim's word2vec to transform a column of a pandas dataframe into a vector that I can pass to a sklearn classifier to make a prediction.

I understand that I need to average the vectors for each row. I have tried following this guide but I am stuck, as I am getting models back but I don't think I can access the underlying embeddings to find the averages.

Please see a minimal, reproducible example below:

import pandas as pd, numpy as np
from gensim.models import Word2Vec
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.feature_extraction.text import CountVectorizer

temp_df = pd.DataFrame.from_dict({'ID': [1,2,3,4,5], 'ContData': [np.random.randint(1, 10 + 1)]*5, 
                                'Text': ['Lorem ipsum dolor sit amet', 'consectetur adipiscing elit.', 'Sed elementum ultricies varius.',
                                         'Nunc vel risus sed ligula ultrices maximus id qui', 'Pellentesque pellentesque sodales purus,'],
                                'Class': [1,0,1,0,1]})
temp_df['text_lists'] = [x.split(' ') for x in temp_df['Text']]

w2v_model = Word2Vec(temp_df['text_lists'].values, min_count=1)

cv = CountVectorizer()
count_model = pd.DataFrame(data=cv.fit_transform(temp_df['Text']).todense(), columns=list(cv.get_feature_names_out()))

Using sklearn's CountVectorizer, I am able to get a simple frequency representation that I can pass to a classifier. How can I get that same format using Word2vec?

This toy example produces:

adipiscing  amet    consectetur dolor   elementum   elit    id  ipsum   ligula  lorem   ... purus   qui risus   sed sit sodales ultrices    ultricies   varius  vel
0   0   1   0   1   0   0   0   1   0   1   ... 0   0   0   0   1   0   0   0   0   0
1   1   0   1   0   0   1   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
2   0   0   0   0   1   0   0   0   0   0   ... 0   0   0   1   0   0   0   1   1   0
3   0   0   0   0   0   0   1   0   1   0   ... 0   1   1   1   0   0   1   0   0   1
4   0   0   0   0   0   0   0   0   0   0   ... 1   0   0   0   0   1   0   0   0   0

While this runs without error, I cannot access the embedding that I can pass with this current format. I would like to produce the same format, with the exception of instead of there being counts, its the word2vec value embeddings


Solution

  • While yo might not be able to help it if your original data comes from a Pandas DataFrame, neither Gensim nor Scikit-Learn work with DataFrame-style data natively. Rather, they tend to use raw numpy arrays, or base Python datastructures like lists or iterable sequences.

    Trying to shoehorn interim raw vectors into the Pandas style of data structure tends to add code complication & wasteful overhead.

    That's especially true if the vectors are dense vectors, where essentially all of a smaller-number of dimensions are nonzero, as in word2vec-like algorthms. But that's also true if the vectors are the kinds of sparse vectors, with a giant number of dimensions, but most dimensions 0, that come from CountVectorizer and various "bag-of-words"-style text models.

    So first, I'd recommend against putting the raw outputs of Word2Vec or CountVectorizer, which are usually interim representations on the way to completing some other task, into a DataFrame.

    If you want to have the final assigned-labels in the DataFrame, for analysis or reporting in the Pandas style, only add those final outputs in the end. But to understand the interim vector representations, and then to pass them to things like Scikit-Learn classifiers in the formats those classes expect, keep those vectors (and inspect them yourself for clarity) in the their raw numpy vector formats.

    In particular, after Word2Vec runs (with the parameters you've shown), there'll be a 100-dimensional vector per word. Not per multi-word text. And the 100-dimensions have no names other than their indexes 0 to 99.

    And unlike the dimensions of the CountVectorizer representation, which are counts of individual words, each dimension of the "dense embedding" will be some floating-point decimal value that has no clear or specific interpretation alone: it's only directions/neighborhoods in the whole space, shearing across many dimensions, that vaguely correspond with useful or human-nameable concepts.

    If you want to turn the per-word 100-dimensional vectors into vectors for a multi-word text, there are many potential ways to do so – but one simple choice is to simply average together the N word-vectors into 1 summary vector. Gensim's class holding the word-vectors inside the Word2Vec model, KeyedVectors, has a .get_mean_vector() method that can help. For example:

    texts_as_wordlists = [x.split(' ') for x in temp_df['Text']]
    text_vectors = [w2v_model.wv.get_mean_vector(wordlist) for wordlist in texts_as_wordlists]
    

    There are many other potential ways to use word-vectors to model a longer text. For example, you might reweight the words before averagine. But a simple average is a reasonable first baseline approach. (Other algorithms related to word2vec, like the 'Paragraph Vector' algorihtm implemented by the Doc2Vec class, can also create a vector for a multi-word text, and such a vector is not just the average of its word-vectors.)

    Two other notes on using Word2Vec:

    • word2vec vectors only get good when trained on lots of word-usage data. Toy-sized examples trained on only hundreds, or even tens-of-thousands, of words rarely show anything useful, or anything resembling the power of this algorithm on larger data set.
    • min_count=1 is essentially always a bad idea with this algorithm. Related to the point above, the algorithm needs multiple subtly-contrasting usage examples of any word to have any chance of placing it meaningfully in the shared-coordinate space. Words with just one, or even a few, usages tend to get awful vectors not generalizable to the word's real meaning as would be evident from a larger sample of its use. And, in natural-language corpora, such few-example words are very numerous - so they wind up taking a lot of the training time, and achieving their bad representations actually worsens the vectors for surrounding words, that could be better because there are enough training examples. So, the best practice with word2vec is usually to ignore the rarest words – train as if they weren't even there. (The class's default is min_count=5 for good reasons, and if that results in your model missing vectors for words you think you need, get more data showing uses of those words in real contexts, rather than lowering min_count.)