Search code examples
machine-learningpca

can I except some column for PCA?


Data with five columns

one | two | three | four | five

but I want this results

pca 1 | pca 2 | five

is it possible select only 4 columns for pca ?


Solution

  • There's nothing mathematically unsound about reducing some of your features with PCA. The PCA features are linear combinations (rotated axes) of that sub-space, leaving the other (orthogonal) features unmodified.

    I've included an example of a multivariate gaussian in x,y,z. I use PCA on x and y, leaving z unmodified. You can inspect the plots to convince your self that the second set of points is indeed the same as the first, just rotated in x,y:

    import numpy as np
    import plotly.express as px
    from sklearn.decomposition import PCA
    
    means = [0,0,0]
    cov = [[1,1,0],[-100,100,0],[0,0,1]]
    
    # get scatter points drawn from multivariate
    x,y,z = np.random.multivariate_normal(means, cov, 5000).T
    
    # data
    X = np.array([x,y,z]).T
    
    # initial plot, with largest variance along x=y:
    px.scatter_3d(x=x, y=y, z=z, labels={j:j for j in"xyz"}).show()
    

    original feature values

    # fit pca in the x-y plane, leaving z un-modified
    pca = PCA(n_components=2)
    pca.fit(X[:, 0:2])
    
    # get "rotated" pca components x', y'
    q = pca.transform(X[:,0:2])
    xp, yp = q[:,0], q[:,1]
    
    px.scatter_3d(x=xp, y=yp, z=z, labels={"x":"x'", "y":"y'", "z":"z"}).show()
    

    features after PCA