Search code examples
pythonrpandasscikit-learnpca

Cumulative Explained Variance for PCA in Python


I have a simple R script for running FactoMineR's PCA on a tiny dataframe in order to find the cumulative percentage of variance explained for each variable:

library(FactoMineR)
a <- c(1, 2, 3, 4, 5)
b <- c(4, 2, 9, 23, 3)
c <- c(9, 8, 7, 6, 6)
d <- c(45, 36, 74, 35, 29)

df <- data.frame(a, b, c, d)

df_pca <- PCA(df, ncp = 4, graph=F)
print(df_pca$eig$`cumulative percentage of variance`)

Which returns:

> print(df_pca$eig$`cumulative percentage of variance`)
[1]  58.55305  84.44577  99.86661 100.00000

I'm trying to do the same in Python using scikit-learn's decomposition package as follows:

import pandas as pd
from sklearn import decomposition, linear_model

a = [1, 2, 3, 4, 5]
b = [4, 2, 9, 23, 3]
c = [9, 8, 7, 6, 6]
d = [45, 36, 74, 35, 29]

df = pd.DataFrame({'a': a,
                  'b': b,
                  'c': c, 
                  'd': d})

pca = decomposition.PCA(n_components = 4)
pca.fit(df)
transformed_pca = pca.transform(df)

# sum cumulative variance from each var
cum_explained_var = []
for i in range(0, len(pca.explained_variance_ratio_)):
    if i == 0:
        cum_explained_var.append(pca.explained_variance_ratio_[i])
    else:
        cum_explained_var.append(pca.explained_variance_ratio_[i] + 
                                 cum_explained_var[i-1])
print(cum_explained_var)

But this results in:

[0.79987089715487936, 0.99224337624509307, 0.99997254568237226, 1.0]

As you can see, both correctly add up to 100%, but it seems the contributions of each variable differ between the R and Python versions. Does anyone know where these differences are coming from or how to correctly replicate the R results in Python?

EDIT: Thanks to Vlo, I now know that the differences stem from the FactoMineR PCA function scaling the data by default. By using the sklearn preprocessing package (pca_data = preprocessing.scale(df)) to scale my data before running PCA, my results match the


Solution

  • Thanks to Vlo, I learned that the differences between the FactoMineR PCA function and the sklearn PCA function is that the FactoMineR one scales the data by default. By simply adding a scaling function to my python code, I was able to reproduce the results.

    import pandas as pd
    from sklearn import decomposition, preprocessing
    
    a = [1, 2, 3, 4, 5]
    b = [4, 2, 9, 23, 3]
    c = [9, 8, 7, 6, 6]
    d = [45, 36, 74, 35, 29]
    e = [35, 84, 3, 54, 68]
    
    
    df = pd.DataFrame({'a': a,
                      'b': b,
                      'c': c, 
                      'd': d})
    
    
    pca_data = preprocessing.scale(df)
    
    pca = decomposition.PCA(n_components = 4)
    pca.fit(pca_data)
    transformed_pca = pca.transform(pca_data)
    
    cum_explained_var = []
    for i in range(0, len(pca.explained_variance_ratio_)):
        if i == 0:
            cum_explained_var.append(pca.explained_variance_ratio_[i])
        else:
            cum_explained_var.append(pca.explained_variance_ratio_[i] + 
                                     cum_explained_var[i-1])
    
    print(cum_explained_var)
    

    Output:

    [0.58553054049052267, 0.8444577483783724, 0.9986661265687754, 0.99999999999999978]