Search code examples
pythonmatplotlibscikit-learnpca

Plotting new documents to scatter plot


I am looking to gain some insight into my data. I am converting them into VSM using sklearn PCA and plotting them to a matplotlib graph. THis involves

  1. Casting the documents to a number matrix using pipeline

    test = pipeline.fit_transform(docs).todense()
    
  2. Fitting it to my model

    pca = PCA().fit(test)
    
  3. Then I am converting it using transform

        data = pca.transform(test)
    
  4. Finally I am plotting the results using Matplotlib

       plt.scatter(data[:,0], data[:,1], c = categories)
    

My question is this: How do I take new sentences and determine where they would lie in relation to the other documents plotted. Using an X to mark their relative positions ?

Thanks


Solution

    1. Also cast the new documents to a numeric array

      new = pipeline.transform(new_docs).todense()
      

      Note that this uses the pipeline with the previously fitted parameters, hence it's pipeline.transform, not pipeline.fit_transform.

    2. Transform the new data using the previously fitted pca.

      new_data = pca.transform(new)
      

      This will transform the new data to the same PC-space as the original data.

    3. Add the new data to the plot using a second scatter.

      plt.scatter(data[:,0], data[:,1], c = categories)
      plt.scatter(new_data[:,0], new_data[:,1], marker = 'x')
      plt.show()