Hi I have a theoretical question from a code that works fine.
I am running a PCA to the load_breast_cancer dataset from sklearn. After running the PCA I plot the data based on the first two principal components and I know I can color the points of data by a key from the original load_breast_cancer dataset, namely ''target".
The code I am particularly concerned is when I plot and I write "c=cancer['target']". How does the 'target' column is retained through all of the PCA and scalling specially since the x_pca is a numpy.ndarray with shape (569, 2)?
Code below:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
%matplotlib inline
#importing dataset
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
df = pd.DataFrame(cancer['data'],columns=cancer['feature_names'])
#scalling
scaler = StandardScaler()
scaler.fit(df)
scaled_data = scaler.transform(df)
# PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(scaled_data)
x_pca = pca.transform(scaled_data)
#Plotting
plt.figure(figsize=(8,6))
#Note that it is an array, not a dataframe so brackets refer to order
plt.scatter(x_pca[:,0],x_pca[:,1],c=cancer['target'],cmap='plasma')
plt.xlabel('First PC')
plt.ylabel('Second PC')
Thank you!
It seems that you run df
through a pipeline, and df
does not include target
as a column. So it is not transformed in the process.