Search code examples
pythonscikit-learnpca

adding a point to a PCA model


I am using PCA to reduce documents to 2 points so I can visualise them. My method looks like this.

  pipeline = Pipeline([('tfidf', TfidfVectorizer())])

   X = pipeline.fit_transform(sent_list).todense()


  pca = PCA(n_components = 2).fit(X)


   data2D = pca.fit_transform(X)

Then I am plotting them using matplotlib plt.scatter(data2D[:,0], data2D[:,1], c = label_colour)

I want to add 2 new points and see where they fall in the model. TO date I included the initial points in the training data at the end and plotted an X over the last two positions in the array, but I am not sure if this is a true reflection of their value. Any insight would be great.


Solution

  • Both TfidfVectorizer and PCA retain the order of rows after the transformation, so what you are doing seems essentially correct (i.e. the last rows in the sent_list are mapped to the last rows in the data2D array).

    However, if the new data points should not affect the model, you should first fit the model with the original data and then transform the new data with the already fitted model. For example:

    # Fit the model with original data
    vect = TfidfVectorizer()
    X = vect.fit_transform(sent_list)
    
    svd = TruncatedSVD(n_components = 2)
    data2D = svd.fit_transform(X)
    
    # Transform new data with fitted model
    X_new = vect.transform(new_data)
    data2D_new = svd.transform(X_new)
    

    For performance reasons, it is probably better to use TruncatedSVD for sparse matrices instead of densifying the data and applying PCA. The results should be identical.