I am new to python programming and would like to ask about the PCA (Principal Component Analysis) in numpy. I have a dataset containing 2d numpy array. How can I find PCA on this dataset using numpy. What will be the best method?
Output of the list:
[[ 9.59440303 -30.33995167 -9.56393401 ..., 20.47675724 21.32716639
4.72543396]
[ 9.51383834 -29.91598995 -15.53265741 ..., 29.3551776 22.27276737
0.21362916]
[ 9.51410643 -29.76027936 -14.61218821 ..., 26.02439054 4.7944802
-4.97069797]
...,
[ 10.18460025 -25.08264383 -8.48524125 ..., -3.86304594 -7.48117144
0.49041786]
[ 10.11421507 -27.23984612 -8.57355611 ..., 1.86266657 -5.25912341
4.07026804]
[ 11.86344836 -29.08311293 -6.40004177 ..., 3.81287345 -8.21500311
18.31793505]]
The given data is for example but the actual data contains very long datas which may be corelated. You can use Iris data or some other dummy data.
As Nils suggested, the easiest solution is to use the PCA class from the scikit-learn package. If for some reason you can't use scikit-learn, the PCA algorithm itself is fairly simple. In the source code of scikit-learn you can find it here: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/pca.py#L408
as a simplified summary:
centered_data = data - np.mean(data)
U, S, V = np.linalg.svd(centered_data, full_matrices=False)
components = V
coefficients = np.dot(U, np.diag(S))