Search code examples
pythonnumpydataframepca

Fabricate a datset to test PCA in Sklearn?


I would like to test my workflow for PCA, to do so I want to create a dataset with lets say 3 features with a set relationship between those features. then apply the PCA and check if the those relationships were captures, what is the most straightforward way to do it in Python ?

Thank you!


Solution

  • You can create samples where two features are independent of each other and a third feature is a linear combination of the other two.

    For example:

    import numpy as np
    from numpy.random import random
    
    N_SAMPLES = 1000
    
    samples = random((N_SAMPLES, 3))
    
    # Let us suppose that the column `1` will have the dependent feature, the other two being independent
    
    samples[:, 1] = 3 * samples[:, 0] - 2 * samples[:, 2]
    

    Now if you run PCA to find two principal components on that sample, the "explained variance" should be equal to 1.

    For example:

    from sklearn.decomposition import PCA
    
    pca2 = PCA(2)
    pca2.fit(samples)
    
    assert sum(pca2.explained_variance_ratio_) == 1.0 # this should be true