Search code examples
pythonscikit-learnpca

Scikitlearn PCA computes incorrect last row of y-values


I'm performing a PCA using Scikitlearn in Python3.

But, after I run my code, the principal component of the last row has an "off" value. I know for a fact that the last row is correct.

I plotted three PCA's to visualize the problem. The first plot (the full dataset) you can see the "sample" plots as predicted, but, in the second and third plot, if I remove populations (a part of the full dataset) the sample plots "weird".

Plot 1

Plot 2

Plot 3

The dataframe with computed principal components (see last row):

      principal_component_1  principal_component_2 Sample_name         Population
0                  3.279363              -0.288892     HG02291  American_Ancestry
1                  3.625035              -0.296081     HG02275  American_Ancestry
2                  3.870248              -0.264558     HG02272  American_Ancestry
3                  3.118460              -0.272594     HG02271  American_Ancestry
4                  2.811992              -0.376418     HG02259  American_Ancestry
...                     ...                    ...         ...                ...
1590               1.849372              -0.167314   HGDP00555  Oceanian_Ancestry
1591               1.666233              -0.224749   HGDP00556  Oceanian_Ancestry
1592               1.983947              -0.202254   HGDP00552  Oceanian_Ancestry
1593               2.202948              -0.210858   HGDP00554  Oceanian_Ancestry
1594              -4.693172             126.672265      Sample             Sample

The code that I use:

def do_pca(pca_data, sample_name, pops):
    """
    This function plots the PCA data from the sample and dataset in a PCA plot
    """
    
    # initiliaze variabeles for the PCA plot
    pops  = pops + ["Sample"]
    pca_df = pd.read_csv(pca_data, sep=";")
    pca_df = pca_df[pca_df["Population"].isin(pops)].reset_index()
    features = list(pca_df.columns.values)
    features.remove("Population")
    features.remove("Sample_name")
    x = pca_df.loc[:, features].values # Separating out the features
    y = pca_df.loc[:, ["Population", "Sample_name"]] # Separating out the target
    x = StandardScaler().fit_transform(x) # Standardizing the features

    # initiliaze PCA plot
    dot_size = 20
    pca = PCA(n_components=2)
    pc = pca.fit_transform(x)
    pc_df = pd.DataFrame(data=pc, columns=["principal_component_%s" % (x + 1) for x in range(2)])
    
    pc_df["Sample_name"] = y["Sample_name"]
    pc_df["Population"] = y["Population"]
    return pc_df

Can someone explain to me what I do wrong? Is my code off?

I found a similar question on StackOverflow, but it doesn't have an answer: link


Solution

  • try turning it off and on again :/