I am using the KMeans from sklearn to cluster the College.csv. But when I fit the KMeans model, my dataset changes after that! Before using KMeans, I Standardize the numerical variables with StandardScaler
and I use OneHotEncoder
to dummy the categorical variable "Private"
.
My code is:
num_vars = data.columns[1:]
scaler = StandardScaler()
data[num_vars] = scaler.fit_transform(data[num_vars])
ohe = OneHotEncoder()
data["Private"] = ohe.fit_transform(data.Private.values.reshape(-1,1)).toarray()
km = KMeans(n_cluster = 6)
km.fit(data)
There's a subtle bug in the posted code. Let's demonstrate it:
new_df = pd.DataFrame({"Private": ["Yes", "Yes", "No"]})
OneHotEncoder
returns something like this:
new_data = np.array(
[[0, 1],
[0, 1],
[1, 0]])
What happens if we assign new_df["Private"]
with our new (3, 2)
array?
>>> new_df["Private"] = new_data
>>> print(new_df)
Private
0 0
1 0
2 1
Wait, where'd the other column go?
Uh oh, it's still in there ...
... but it's invisible until we look at the actual values:
>>> print(new_df.values)
[[0 1]
[0 1]
[1 0]]
As @Derek hinted in his answer, KMeans has to validate the data, which usually converts from pandas dataframes into the underlying arrays. When this happens, all your "columns" get shifted to the right by one because there was an invisible column created by the OneHotEncoder
.
Is there a better way?
Yep, use a pipeline!
pipe = make_pipeline(
ColumnTransformer(
transformers=[
("ohe", OrdinalEncoder(categories=[["No", "Yes"]]), ["Private"]),
],
remainder=StandardScaler(),
),
KMeans(n_clusters=6),
)
out = pipe.fit(df)