Search code examples
pythonpandasscikit-learnpca

Why does sklearn's train/test split plus PCA make my labelling incorrect?


I am exploring PCA in Scikit-learn (0.20 on Python 3) using Pandas for structuring my data. When I apply a test/train split (and only when), my input labels seem to no longer match up with the PCA output.

import pandas
import sklearn.datasets
from matplotlib import pyplot
import seaborn

def load_bc_as_dataframe():
    data = sklearn.datasets.load_breast_cancer()
    df = pandas.DataFrame(data.data, columns=data.feature_names)
    df['diagnosis'] = pandas.Series(data.target_names[data.target])
    return data.feature_names.tolist(), df

feature_names, bc_data = load_bc_as_dataframe()

from sklearn.model_selection import train_test_split
# bc_train, _ = train_test_split(bc_data, test_size=0)
bc_train = bc_data

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
bc_pca_raw = pca.fit_transform(bc_train[feature_names])
bc_pca = pandas.DataFrame(bc_pca_raw, columns=('PCA 1', 'PCA 2'))
bc_pca['diagnosis'] = bc_train['diagnosis']

seaborn.scatterplot(
    data=bc_pca,
    x='PCA 1',
    y='PCA 2',
    hue='diagnosis',
    style='diagnosis'
)

pyplot.show()

enter image description here

This looks reasonable, and that's borne out by accurate classification results. If I replace the bc_train = bc_data with a train_test_split() call (even with test_size=0), my labels seem to no longer correspond to the original ones.

enter image description here

I realise that train_test_split() is shuffling my data (which I want it to, in general), but I don't see why that would be a problem, since the PCA and the label assignment use the same shuffled data. PCA's transformation is just a projection, and while it obviously doesn't retain the same features (columns), it shouldn't change which label goes with which frame.

How can I correctly relabel my PCA output?


Solution

  • The issue has three parts:

    1. The shuffling in train_test_split() causes the indices in bc_train to be in a random order (compared to the row location).
    2. PCA operates on numerical matrices, and effectively strips the indices from the input. Creating a new DataFrame recreates the indices to be sequential (compared to the row location).
    3. Now we have random indices in bc_train and sequential indices in bc_pca. When I do bc_pca['diagnosis'] = bc_train['diagnosis'], bc_train is reindexed with bc_pcas indices. This reorders the bc_train data so that its indices match bc_pcas.

    To put it another way, Pandas does a left-join on the indices when I assign with bc_pca['diagnosis'] (ie. __setitem__()), not a row-by-row copy (similar to update().

    I don't find this intuitive, and I couldn't find documentation on __setitem__()'s behaviour beyond the source code, but I expect it makes sense to a more experienced Pandas user, and maybe it's documented at a higher level somewhere I haven't seen.

    There are a number of ways to avoid this. I can reset the index of the training/test data:

    bc_train, _ = train_test_split(bc_data, test_size=0)
    bc_train.reset_index(inplace=True)
    

    Alternatively I could assign from the values member:

    bc_pca['diagnosis'] = bc_train['diagnosis'].values
    

    I could also do a similar thing before constructing the DataFrame (arguably more sensible, since PCA is effectively operating on bc_train[feature_names].values).