I am exploring PCA in Scikit-learn (0.20 on Python 3) using Pandas for structuring my data. When I apply a test/train split (and only when), my input labels seem to no longer match up with the PCA output.
import pandas
import sklearn.datasets
from matplotlib import pyplot
import seaborn
def load_bc_as_dataframe():
data = sklearn.datasets.load_breast_cancer()
df = pandas.DataFrame(data.data, columns=data.feature_names)
df['diagnosis'] = pandas.Series(data.target_names[data.target])
return data.feature_names.tolist(), df
feature_names, bc_data = load_bc_as_dataframe()
from sklearn.model_selection import train_test_split
# bc_train, _ = train_test_split(bc_data, test_size=0)
bc_train = bc_data
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
bc_pca_raw = pca.fit_transform(bc_train[feature_names])
bc_pca = pandas.DataFrame(bc_pca_raw, columns=('PCA 1', 'PCA 2'))
bc_pca['diagnosis'] = bc_train['diagnosis']
seaborn.scatterplot(
data=bc_pca,
x='PCA 1',
y='PCA 2',
hue='diagnosis',
style='diagnosis'
)
pyplot.show()
This looks reasonable, and that's borne out by accurate classification results. If I replace the bc_train = bc_data
with a train_test_split()
call (even with test_size=0
), my labels seem to no longer correspond to the original ones.
I realise that train_test_split()
is shuffling my data (which I want it to, in general), but I don't see why that would be a problem, since the PCA and the label assignment use the same shuffled data. PCA's transformation is just a projection, and while it obviously doesn't retain the same features (columns), it shouldn't change which label goes with which frame.
How can I correctly relabel my PCA output?
The issue has three parts:
train_test_split()
causes the indices in bc_train
to be in a random order (compared to the row location).DataFrame
recreates the indices to be sequential (compared to the row location).bc_train
and sequential indices in bc_pca
. When I do bc_pca['diagnosis'] = bc_train['diagnosis']
, bc_train
is reindexed with bc_pca
s indices. This reorders the bc_train
data so that its indices match bc_pca
s.To put it another way, Pandas does a left-join on the indices when I assign with bc_pca['diagnosis']
(ie. __setitem__()
), not a row-by-row copy (similar to update()
.
I don't find this intuitive, and I couldn't find documentation on __setitem__()
's behaviour beyond the source code, but I expect it makes sense to a more experienced Pandas user, and maybe it's documented at a higher level somewhere I haven't seen.
There are a number of ways to avoid this. I can reset the index of the training/test data:
bc_train, _ = train_test_split(bc_data, test_size=0)
bc_train.reset_index(inplace=True)
Alternatively I could assign from the values
member:
bc_pca['diagnosis'] = bc_train['diagnosis'].values
I could also do a similar thing before constructing the DataFrame (arguably more sensible, since PCA is effectively operating on bc_train[feature_names].values
).