Search code examples
pythonscikit-learnpca

How to interpret explained variance ratio plot from principal components of PCA with sklearn


I try to use PCA to reduce the dimension of my data before applying K-means clustering.

In the below dataset, I have points, assists and rebounds columns. According to the plot, the first three principal component contains the highest % of the variance.

Is there a way to tell what each of the first 3 components correspond to? For example, if it corresponds to the column "points" in the year of 2021 or so. Or, what should be the correct way to interpret this plot?

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

df_full =  pd.DataFrame({'year':[2021,2021,2021,2021,2021,2021,2021,2021,2021,2021,
                            2022,2022,2022,2022,2022,2022,2022,2022,2022,2022],
                     'store':['store1','store2','store3','store4','store5','store6','store7','store8','store9','store10',
                    'store1','store2','store3','store4','store5','store6','store7','store8','store9','store10'],
                'points': [18, 33, 19, 14, 14, 11, 20, 28, 30, 31,
                          35, 33, 29, 25, 25, 27, 29, 30, 19, 23],
               'assists': [3, 3, 4, 5, 4, 7, 8, 7, 6, 9, 12, 14,
                           5, 9, 4, 3, 4, 12, 15, 11],
               'rebounds': [15, 14, 14, 10, 8, 14, 13, 9, 5, 4,
                            11, 6, 5, 5, 3, 8, 12, 7, 6, 5]})

# create pivot table for clustering analysis
df = df_full.pivot(index=['store'],columns=['year']).reset_index()

# set index for clustering analysis
df.set_index(['store'], inplace=True)

# standarized df
scaled_df = StandardScaler().fit_transform(df)

# check what is the best n components
pca = PCA(n_components=6)
pca.fit(scaled_df)

var = pca.explained_variance_ratio_
plt.bar(list(range(var.shape[0])),var)
feature = range(pca.n_components_)
plt.xlabel('PCA features')
plt.ylabel('variance %')
plt.xticks(feature)

# use optimized number for n componets which is 3 in this case
pca = PCA(n_components=3)
pca.fit(scaled_df)

df_transform = pca.transform(scaled_df)

# apply kmean cluster
kmeans = KMeans(init="random", n_clusters=3, n_init=10, random_state=1)

enter image description here

pca.components_
array([[ 0.35130535, -0.50070859,  0.29700875,  0.26964774, -0.59032958,
    -0.34126579],
   [ 0.56248993,  0.3654443 , -0.30040924,  0.65744874,  0.09535234,
     0.13593718],
   [ 0.18181155,  0.05593549,  0.69079082, -0.0149547 , -0.00170045,
     0.69742189]])

Solution

  • Wikipedia summarizes the definition of PCA pretty good in my opinion:

    PCA is defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some scalar projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.

    As you can see from this definition, the principle components are just vectors in your original feature space. As an example, let's say you have measured the mental state of some people in two dimensions: "happiness" (dimension 1) and "boredom" (dimension 2). Now you do PCA and get a vector (0.6, 0.4) as your first principle component. You can interpret this as your selection of people being best described by a mental state which combines "happiness" with 60% relevance and "boredom" with 40% relevance, if you only want one dimension to describe them.

    In sklearn, you can get the principle components via pca.components_.

    Mathematically there are different interpretations and derivations. From a statistical point of view, the principle components are the the eigenvectors of the covariance matrix of your random variables (feature vectors). In linear algebra, you use singular value decomposition (SVD) to describe it. SVD is also the common method for computing PCA.