I am using PCA to reduce documents to 2 points so I can visualise them. My method looks like this.
pipeline = Pipeline([('tfidf', TfidfVectorizer())])
X = pipeline.fit_transform(sent_list).todense()
pca = PCA(n_components = 2).fit(X)
data2D = pca.fit_transform(X)
Then I am plotting them using matplotlib plt.scatter(data2D[:,0], data2D[:,1], c = label_colour)
I want to add 2 new points and see where they fall in the model. TO date I included the initial points in the training data at the end and plotted an X over the last two positions in the array, but I am not sure if this is a true reflection of their value. Any insight would be great.
Both TfidfVectorizer
and PCA
retain the order of rows after the transformation, so what you are doing seems essentially correct (i.e. the last rows in the sent_list
are mapped to the last rows in the data2D
array).
However, if the new data points should not affect the model, you should first fit the model with the original data and then transform the new data with the already fitted model. For example:
# Fit the model with original data
vect = TfidfVectorizer()
X = vect.fit_transform(sent_list)
svd = TruncatedSVD(n_components = 2)
data2D = svd.fit_transform(X)
# Transform new data with fitted model
X_new = vect.transform(new_data)
data2D_new = svd.transform(X_new)
For performance reasons, it is probably better to use TruncatedSVD
for sparse matrices instead of densifying the data and applying PCA
. The results should be identical.