I got a word2vec model abuse_model
trained by Gensim. I want to apply PCA and make a plot on CERTAIN words that I only care about (vs. all words in the model). Therefore, I created a dict d
whose keys are words that I care about and the values are vectors to the key.
vocab = list(abuse_model.wv.key_to_index)
vocab = [v for v in vocab if v in positive_terms]
d = {}
for word in vocab:
d[word] = abuse_model.wv[word]
No errors so far.
I encountered an error when passing the dict into pca.fit_transform
. I'm new to it and am wondering if the data format that I passed in (list of tuples) is not correct. What data type that the argument has to be?
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
result = pca.fit_transform(list(d.items()))
Thanks in advance!
Per scikit-learn
docs – https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA.fit_transform – the argument to .fit_transform()
, as is usual for scikit-learn
models, is "array-like of shape (n_samples, n_features)".
Here, that'd mean your samples/rows are words, and features/columns the word-vector dimensions. And, you'll want to remember outside of the PCA
object which words correspond to which rows. (In Python 3.x, the fact your d
dict
will always iterate in the order of insertion should have you covered there.)
So, it may be enough to change your use of .items()
to .values()
, so that you wind up supplying PCA
with your list
(which is suitably array-like) of vectors.
A few other notes:
.key_to_index
property is already a list
, so you don't need to convert/copy itpositive_terms
is a large list
, changing it to a set
could offer faster in
membership-testingd
dict
, which involves a little more overhead (including when you then make a list
of its values), if your sets-of-words and vectors are large, you might want to preallocate a numpy
array of the right size and collect your vectors in it. For example:X = np.empty((len(vocab), abuse_model.wv.vector_size)
for i, word in enumerate(vocab):
X[i] = abuse_model.wv[word]
#...
#...
result = pca.fit_transform(X)