I am following this tutorial, which compares the explained variance in the top 50 PC's of a dataset to the top 50 PC's of several permutations of that same dataset. It appears they only permute by the columns.
https://towardsdatascience.com/how-to-tune-hyperparameters-of-tsne-7c0596a18868
I tried to replicate this in python, but I'm getting the exact same explained variance for all permutations. Can someone help me understand why my permuted data's permutations explained variance are exactly the same?
def exp_var_perm_data(data, n_permutations=1):
"""
data: Assumed to be a pandas dataframe, object that has a .shape attribute
n_permutations: Integer. Number of permutations to perform
"""
df = pd.DataFrame(columns=["Dim%d" % i for i in range(0, data.shape[1])])
for k in range(0,n_permutations):
pca_permuted = PCA()
data_permuted = data.sample(frac=1).reset_index()
pca_permuted.fit(data_permuted)
df.loc[k] = pca_permuted.explained_variance_ratio_
return df
from sklearn import datasets
import pandas as pd
iris_data = datasets.load_iris()
iris_data = iris_data.data
exp_var_perm = exp_var_perm_data(pd.DataFrame(iris_data), 10)
print(exp_var_perm)
Output:
Dim0 Dim1 Dim2 Dim3
0 0.879444 0.093535 0.021659 0.005363
1 0.879444 0.093535 0.021659 0.005363
2 0.879444 0.093535 0.021659 0.005363
3 0.879444 0.093535 0.021659 0.005363
4 0.879444 0.093535 0.021659 0.005363
5 0.879444 0.093535 0.021659 0.005363
6 0.879444 0.093535 0.021659 0.005363
7 0.879444 0.093535 0.021659 0.005363
8 0.879444 0.093535 0.021659 0.005363
9 0.879444 0.093535 0.021659 0.005363
The tutorial permutes each column independently, as far as I can read R code:
expr_perm <- apply(expr,2,sample)
This seems reasonable, as the goal is to generate data under the null hypothesis of zero covariance.
However, the corresponding code in the question permutes the whole dataframe (all columns together):
data_permuted = data.sample(frac=1).reset_index(drop=True)
Similar to the R code, we can use apply
to permute each column (using a small helper function to do the permutation):
data_permuted = data.apply(permute, axis=1, raw=True)
Here is the fully working example:
from sklearn.decomposition import PCA
from sklearn import datasets
import pandas as pd
import numpy as np
def exp_var_perm_data(data, n_permutations=1):
"""
data: Assumed to be a pandas dataframe, object that has a .shape attribute
n_permutations: Integer. Number of permutations to perform
"""
df = pd.DataFrame(columns=["Dim%d" % i for i in range(0, data.shape[1])])
for k in range(0,n_permutations):
pca_permuted = PCA()
data_permuted = data.apply(permute, axis=1, raw=True)
pca_permuted.fit(data_permuted)
df.loc[k] = pca_permuted.explained_variance_ratio_
return df
def permute(x):
"""Create a randomly permuted copy of x"""
x = x.copy()
np.random.shuffle(x)
return x
iris_data = datasets.load_iris()
iris_data = iris_data.data
exp_var_perm = exp_var_perm_data(pd.DataFrame(iris_data), 10)
print(exp_var_perm)