Search code examples
pythonscikit-learncluster-analysisk-means

Why does sklearn KMeans changes my dataset after fitting?


I am using the KMeans from sklearn to cluster the College.csv. But when I fit the KMeans model, my dataset changes after that! Before using KMeans, I Standardize the numerical variables with StandardScaler and I use OneHotEncoder to dummy the categorical variable "Private".

My code is:

num_vars = data.columns[1:]
scaler = StandardScaler()
data[num_vars] = scaler.fit_transform(data[num_vars])

ohe = OneHotEncoder()
data["Private"] = ohe.fit_transform(data.Private.values.reshape(-1,1)).toarray()

km = KMeans(n_cluster = 6)
km.fit(data)

The dataset before using the KMeans: enter image description here

The dataset after using the KMeans: enter image description here


Solution

  • There's a subtle bug in the posted code. Let's demonstrate it:

    new_df = pd.DataFrame({"Private": ["Yes", "Yes", "No"]})
    

    OneHotEncoder returns something like this:

    new_data = np.array(
        [[0, 1],
         [0, 1],
         [1, 0]])
    

    What happens if we assign new_df["Private"] with our new (3, 2) array?

    >>> new_df["Private"] = new_data
    >>> print(new_df)
       Private
    0        0
    1        0
    2        1
    

    Wait, where'd the other column go?

    Uh oh, it's still in there ...

    ... but it's invisible until we look at the actual values:

    >>> print(new_df.values)
    [[0 1]
     [0 1]
     [1 0]]
    

    As @Derek hinted in his answer, KMeans has to validate the data, which usually converts from pandas dataframes into the underlying arrays. When this happens, all your "columns" get shifted to the right by one because there was an invisible column created by the OneHotEncoder.


    Is there a better way?

    Yep, use a pipeline!

    pipe = make_pipeline(
        ColumnTransformer(
            transformers=[
                ("ohe", OrdinalEncoder(categories=[["No", "Yes"]]), ["Private"]),
            ],
            remainder=StandardScaler(),
        ),
        KMeans(n_clusters=6),
    )
    
    out = pipe.fit(df)