Search code examples
python-3.xk-meanspcasklearn-pandas

After choosing K-components in PCA how do we find out which components(names of the columns) have algorithm selected?


I am new to Data Science and I need some help to understand PCA.I know that each of columns constitute one axis,but when PCA is done and components are reduced to some k value,How to know which all columns got selected?


Solution

  • In PCA you compute the eigenvectors and eigenvalues of the covariance matrix to identify the principal components.
    Principal components are new variables that are constructed as linear combinations or mixtures of the initial variables. These combinations are done in such a way that the new variables (i.e., principal components) are uncorrelated and most of the information within the initial variables is squeezed or compressed into the first components. So, the idea is 10-dimensional data gives you 10 principal components, but PCA tries to put maximum possible information in the first component, then maximum remaining information in the second and so on.

    Geometrically speaking, principal components represent the directions of the data that explain a maximal amount of variance, that is to say, the lines that capture most information of the data. s there are as many principal components as there are variables in the data, principal components are constructed in such a manner that the first principal component accounts for the largest possible variance in the data set.

    According to my experience, if the percentage of cumulative sum of Eigen values can over 80% or 90%, the transformed vectors will be enough to represent the old vectors.

    To explain clearly lets use @Nicholas M's code.

    import numpy as np
    from sklearn.decomposition import PCA
    X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
    pca = PCA(n_components=1)
    pca.fit(X)  
    

    You must increase the n_components to get %90 variance.

    Input:

    pca.explained_variance_ratio_
    

    Output:

    array([0.99244289])
    

    On this example just 1 component is enough.

    I hope its all clear to understand.

    Resources:
    https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60 https://towardsdatascience.com/a-step-by-step-explanation-of-principal-component-analysis-b836fb9c97e2