What is the data type of X in pca.fit_transform(X)?

I got a word2vec model abuse_model trained by Gensim. I want to apply PCA and make a plot on CERTAIN words that I only care about (vs. all words in the model). Therefore, I created a dict d whose keys are words that I care about and the values are vectors to the key.

vocab = list(abuse_model.wv.key_to_index)
vocab = [v for v in vocab if v in positive_terms]
d = {}
for word in vocab:
    d[word] = abuse_model.wv[word]

No errors so far.

I encountered an error when passing the dict into pca.fit_transform. I'm new to it and am wondering if the data format that I passed in (list of tuples) is not correct. What data type that the argument has to be?

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
result = pca.fit_transform(list(d.items()))

Thanks in advance!

Solution

Per scikit-learn docs – https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA.fit_transform – the argument to .fit_transform(), as is usual for scikit-learn models, is "array-like of shape (n_samples, n_features)".

Here, that'd mean your samples/rows are words, and features/columns the word-vector dimensions. And, you'll want to remember outside of the PCA object which words correspond to which rows. (In Python 3.x, the fact your d dict will always iterate in the order of insertion should have you covered there.)

So, it may be enough to change your use of .items() to .values(), so that you wind up supplying PCA with your list (which is suitably array-like) of vectors.

A few other notes:

the .key_to_index property is already a list, so you don't need to convert/copy it
if your positive_terms is a large list, changing it to a set could offer faster in membership-testing
rather than using a d dict, which involves a little more overhead (including when you then make a list of its values), if your sets-of-words and vectors are large, you might want to preallocate a numpy array of the right size and collect your vectors in it. For example:

X = np.empty((len(vocab), abuse_model.wv.vector_size)
for i, word in enumerate(vocab):
    X[i] = abuse_model.wv[word]

#...
#...

result = pca.fit_transform(X)

Even though your hunch is you only want the dimensionality-reduction on your subset of words, you may also want to try keeping all words, or some random subset of other words – it might help retain some of the original structure that otherwise, your subsampling will have prematurely removed. (Unsure of this; just noting it could be a factor.) Even if you do the PCA on a larger set of words, you could still choose to only later plot/analyze your desired subset for clarity.