Search code examples
pythonmachine-learningscikit-learnk-meansfeature-clustering

How to transform inverse after clustering


I want to recover my data after K-means clustering on a scaled dataset with MinMaxScaler, here is a sample of my code

copy_df=scaled_df.copy()
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(features)
copy_df['Cluster'] = kmeans.predict(features)

The scaler was saved; I tried something like: x = scaler.inverse_transform(x)

My copy_df should have one more column compared to my scaled_df ( the Cluster number )

I guess that's why I'm getting :

ValueError: operands could not be broadcast together with shapes (3,5) (4,) (3,5) 

How could I recover my data?

I need to get the real data of my clusters or the mean of each feature.


Solution

  • There is a mismatch between the shape the MinMaxScaler() expected (based on the fit) and what you provided after the clustering (which has one more column the cluster membership). You could assign the cluster labels directly to the original data or if you really need to do the inverse then you could do it by first inverse_transform the scaled data again and then add the cluster labels to it. The both result in the same dataframe.

    # Import the packages
    import pandas as pd
    from sklearn.datasets import load_iris
    from sklearn.preprocessing import MinMaxScaler
    from sklearn.cluster import KMeans
    
    # Load the data
    data = pd.DataFrame(load_iris()['data'])
    
    # Initialize a scaler
    scaler = MinMaxScaler()
    
    # Perform scaling
    data_scaled = pd.DataFrame(scaler.fit_transform(data))
    
    # Initialize KMeans clustering
    kmeans = KMeans(n_clusters=3, random_state=42)
    
    # Obtain the clusters
    clusters = kmeans.fit_predict(data_scaled)
    
    # Add the cluster labels to the original data
    data['clusters'] = clusters
    

    OR

    # Inverse the scaling and add the cluster labels as a new column
    data_invscaled = pd.DataFrame(scaler.inverse_transform(data_scaled.iloc[:, 0:4]))
    data_invscaled['clusters'] = clusters
    
    # Check whether the two dfs are equal --> None means that the two dfs are equal
    print(pd.testing.assert_frame_equal(data, data_invscaled, check_dtype=False))