I am studying the latent space of a generative model, where the data in my latent space have a shape of (64, 64, 3). I would like to visualize a subset of this data, say n=5, in a 2D plot. To achieve this, I have reshaped the data to have a shape of (5, 12288) and used PCA to reduce it to the first 2 principal components, which I then plot using matplotlib.
However, I am uncertain about the amount of variance captured by the PCA. When I check, it shows that more than 99% of the variance is captured. I think this might be due to the small number of samples that I used, such that the singular values can only be at most 5 in this case. Is my understanding correct? Does this mean that the variance captured by the PCA is not meaningful for the full latent space?
Here is the code I used to reshape my data, reduce it with PCA, and check the captured variance:
import numpy as np
from sklearn.decomposition import PCA
def matrix_to_point(A):
# Convert a matrix to a point by flattening it
return A.reshape(-1)
n = 5
latent_sample = np.random.rand(n, *(64, 64, 3))
data = np.asarray([ matrix_to_point(m) for m in latent_sample])
pca= PCA(n_components=2)
pca = pca.fit(data)
reduced_data = pca.transform(data)
print(f'Variance captured by the PCA: {pca.explained_variance_ratio_}')
#output with the posted code: Variance captured by the PCA: [0.25629761 0.25391076]
#output with the complete code: Variance captured by the PCA: [0.96852827 0.03129395]
In this code, I substituted the actual latent sample with some random samples to make it executable. Thank you in advance for your assistance
I would try judging the quality of the PCA by inversely transforming the reduced data and judge the result, here I used RSME, but you can use another metric if that suits your use case better:
import numpy as np
from sklearn.decomposition import PCA
def matrix_to_point(A):
# Convert a matrix to a point by flattening it
return A.reshape(-1)
n = 5
latent_sample = np.random.rand(n, *(64, 64, 3))
data = np.asarray([ matrix_to_point(m) for m in latent_sample])
pca= PCA(n_components=2)
pca = pca.fit(data)
reduced_data = pca.transform(data)
print(f'Variance captured by the PCA: {pca.explained_variance_ratio_}')
expanded_data = pca.inverse_transform(reduced_data)
rmse = np.mean(np.sqrt((expanded_data - data)**2))
print(f'Root mean square error: {rmse}')
In case your data is actually a two-dimensional subspace of the entire space, the fit will be very good and the RSME will be very small.