Search code examples
pythonk-meanspca

recover column details after PCA and Kmeans


I did KMeans clustering after reducing numerical columns in my DataFrame from 5 to 2 using PCA and plotted scatterplot

pc=PCA(n_components = 2).fit_transform(scaled_df)
scaled_df_PCA= pd.DataFrame(pc, columns=['pca_col1','pca_col2'])


#Then I did the KMeans and its plotting

label_PCA=final_km.fit_predict(scaled_df_PCA)
scaled_df_PCA["label_PCA_df"]=label_PCA

a=scaled_df_PCA[scaled_df_PCA.label_PCA_df==0]
b=scaled_df_PCA[scaled_df_PCA.label_PCA_df==1]
c=scaled_df_PCA[scaled_df_PCA.label_PCA_df==2]

sns.scatterplot(a.pca_col1, a.pca_col2, color="green")
sns.scatterplot(b.pca_col1, b.pca_col2, color="red")
sns.scatterplot(c.pca_col1, c.pca_col2, color="yellow")

I get 3 clusters from above based upon 2 columns reduced using PCA. Now I wish to get the columns back for further analysis of those clusters but I am not able to. And when i use pc.components_ I get error :

AttributeError Traceback (most recent call last) /tmp/ipykernel_33/4073743739.py in ----> 1 pc.components_

AttributeError: 'numpy.ndarray' object has no attribute 'components_'

or when I do scaled_df_PCA.components_
AttributeError: 'DataFrame' object has no attribute 'components_'

So I wanted to know how to recover details of columns back which were reduced during PCA.


Solution

  • This line from your code stores an NDArray into pc rather than the PCA instance.

    pc=PCA(n_components = 2).fit_transform(scaled_df)
    

    An easy fix is to create the PCA instance first and then call fit_transform().

    pca = PCA(n_components=2)
    df_transformed = pca.fit_transform(scaled_df)
    

    Afterwards, you can still access attributes and methods of the PCA instance, pca.