Feature selection using PCA

Data set consists of N elements and K variables. Using PCA I can reduce the number of variables, but how to check which from K of variables provided the most informations?

For example I have data set like this:

I know that 1st column is the same like 2nd and 4th column is determined by relation: 2*1st+5*3rd-5. Therefore 1st and 3rd column provide the most informations and rest not provide any additional informations. But how calculate this using PCA ?

Solution

For your example:

           PC1         PC2           PC3           PC4
[1,] 0.3516359 -0.79142416  2.497231e-17 -1.299998e-16
[2,] 0.3516359 -0.79142416  1.713028e-16  1.168541e-16
[3,] 1.0831644  0.32331520  4.906878e-16 -3.286408e-17
[4,] 6.1190936  0.03372767 -9.813756e-17  6.572817e-18

Principal Components are the sqrt(Eigenvalues) x Eigenvectors of the Covariance Matrix (these are guaranteed real and orthogonal since Cov is symmetric positive semi-definite).

You can see in your example that two factors are sufficient to explain all variables within rounding tolerance (i.e. the 'rank' of the covariance matrix is 2) as PC3 and PC4 are almost zero.

This effectively rotates to a vector space that aligns axes with hidden variables. To talk about columns in your original problem you need to map back. E.g. you could look at the largest item; here we'd say PC1 is mainly linked to 'Col 4', but this is arbitrary and PC2 has equal weight on 'Col 1' and 'Col 2' - either choice looks just as good. Remember correlation does not imply causality.