I am looking to gain some insight into my data. I am converting them into VSM using sklearn PCA and plotting them to a matplotlib graph. THis involves
Casting the documents to a number matrix using pipeline
test = pipeline.fit_transform(docs).todense()
Fitting it to my model
pca = PCA().fit(test)
Then I am converting it using transform
data = pca.transform(test)
Finally I am plotting the results using Matplotlib
plt.scatter(data[:,0], data[:,1], c = categories)
My question is this: How do I take new sentences and determine where they would lie in relation to the other documents plotted. Using an X to mark their relative positions ?
Thanks
Also cast the new documents to a numeric array
new = pipeline.transform(new_docs).todense()
Note that this uses the pipeline
with the previously fitted parameters, hence it's pipeline.transform
, not pipeline.fit_transform
.
Transform the new data using the previously fitted pca
.
new_data = pca.transform(new)
This will transform the new data to the same PC-space as the original data.
Add the new data to the plot using a second scatter
.
plt.scatter(data[:,0], data[:,1], c = categories)
plt.scatter(new_data[:,0], new_data[:,1], marker = 'x')
plt.show()