python machine-learning scikit-learn k-means feature-clustering

How to transform inverse after clustering

I want to recover my data after K-means clustering on a scaled dataset with MinMaxScaler, here is a sample of my code

copy_df=scaled_df.copy()
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(features)
copy_df['Cluster'] = kmeans.predict(features)

The scaler was saved; I tried something like: x = scaler.inverse_transform(x)

My copy_df should have one more column compared to my scaled_df ( the Cluster number )

I guess that's why I'm getting :

ValueError: operands could not be broadcast together with shapes (3,5) (4,) (3,5)

How could I recover my data?

I need to get the real data of my clusters or the mean of each feature.

Solution

There is a mismatch between the shape the MinMaxScaler() expected (based on the fit) and what you provided after the clustering (which has one more column the cluster membership). You could assign the cluster labels directly to the original data or if you really need to do the inverse then you could do it by first inverse_transform the scaled data again and then add the cluster labels to it. The both result in the same dataframe.

# Import the packages
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans

# Load the data
data = pd.DataFrame(load_iris()['data'])

# Initialize a scaler
scaler = MinMaxScaler()

# Perform scaling
data_scaled = pd.DataFrame(scaler.fit_transform(data))

# Initialize KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=42)

# Obtain the clusters
clusters = kmeans.fit_predict(data_scaled)

# Add the cluster labels to the original data
data['clusters'] = clusters

# Inverse the scaling and add the cluster labels as a new column
data_invscaled = pd.DataFrame(scaler.inverse_transform(data_scaled.iloc[:, 0:4]))
data_invscaled['clusters'] = clusters

# Check whether the two dfs are equal --> None means that the two dfs are equal
print(pd.testing.assert_frame_equal(data, data_invscaled, check_dtype=False))