Search code examples
pythonpermutationpcavariance

PCA explained variance is the same on permutations of data


I am following this tutorial, which compares the explained variance in the top 50 PC's of a dataset to the top 50 PC's of several permutations of that same dataset. It appears they only permute by the columns.

https://towardsdatascience.com/how-to-tune-hyperparameters-of-tsne-7c0596a18868

I tried to replicate this in python, but I'm getting the exact same explained variance for all permutations. Can someone help me understand why my permuted data's permutations explained variance are exactly the same?

def exp_var_perm_data(data, n_permutations=1):
    """
        data: Assumed to be a pandas dataframe, object that has a .shape attribute
        n_permutations: Integer. Number of permutations to perform
    """    
    df = pd.DataFrame(columns=["Dim%d" % i for i in range(0, data.shape[1])])
    for k in range(0,n_permutations):
        pca_permuted = PCA()
        data_permuted = data.sample(frac=1).reset_index()
        pca_permuted.fit(data_permuted)
        df.loc[k] = pca_permuted.explained_variance_ratio_
    return df

from sklearn import datasets
import pandas as pd

iris_data = datasets.load_iris()
iris_data = iris_data.data

exp_var_perm = exp_var_perm_data(pd.DataFrame(iris_data), 10)
print(exp_var_perm)

Output:

       Dim0      Dim1      Dim2      Dim3
0  0.879444  0.093535  0.021659  0.005363
1  0.879444  0.093535  0.021659  0.005363
2  0.879444  0.093535  0.021659  0.005363
3  0.879444  0.093535  0.021659  0.005363
4  0.879444  0.093535  0.021659  0.005363
5  0.879444  0.093535  0.021659  0.005363
6  0.879444  0.093535  0.021659  0.005363
7  0.879444  0.093535  0.021659  0.005363
8  0.879444  0.093535  0.021659  0.005363
9  0.879444  0.093535  0.021659  0.005363

Solution

  • The tutorial permutes each column independently, as far as I can read R code:

    expr_perm <- apply(expr,2,sample)
    

    This seems reasonable, as the goal is to generate data under the null hypothesis of zero covariance.

    However, the corresponding code in the question permutes the whole dataframe (all columns together):

    data_permuted = data.sample(frac=1).reset_index(drop=True)
    

    Similar to the R code, we can use apply to permute each column (using a small helper function to do the permutation):

    data_permuted = data.apply(permute, axis=1, raw=True)
    

    Here is the fully working example:

    from sklearn.decomposition import PCA
    from sklearn import datasets
    import pandas as pd
    import numpy as np
    
    
    def exp_var_perm_data(data, n_permutations=1):
        """
            data: Assumed to be a pandas dataframe, object that has a .shape attribute
            n_permutations: Integer. Number of permutations to perform
        """    
        df = pd.DataFrame(columns=["Dim%d" % i for i in range(0, data.shape[1])])
        for k in range(0,n_permutations):
            pca_permuted = PCA()
            data_permuted = data.apply(permute, axis=1, raw=True)
            pca_permuted.fit(data_permuted)
            df.loc[k] = pca_permuted.explained_variance_ratio_
        return df
    
    
    def permute(x):
        """Create a randomly permuted copy of x"""
        x = x.copy()
        np.random.shuffle(x)
        return x
    
    
    iris_data = datasets.load_iris()
    iris_data = iris_data.data
    
    exp_var_perm = exp_var_perm_data(pd.DataFrame(iris_data), 10)
    print(exp_var_perm)