Search code examples
pythonmachine-learningk-means

K-Means not resulting in elbow shape


I'm trying to use k-means in a dataset available at this link using only the variables about the client. The problem is that 7 of the 8 variables are categorical, so I've used one hot encoder on them.

To use the elbow method to select an ideal number of clusters I've ran the KMeans for 2 to 22 clusters and plotted the inertia_ values. But the shape wasn't anything like an elbow, it looked more like a straight line.

Am I doing something wrong?

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans 
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler

bank = pd.read_csv('bank-additional-full.csv', sep=';') #available at https://archive.ics.uci.edu/ml/datasets/Bank+Marketing# 

# 1. selecting only informations about the client
cli_vars = ['age', 'job', 'marital', 'education', 'default', 'housing', 'loan']
bank_cli = bank[cli_vars].copy()

#2. applying one hot encoder to categorical variables
X = bank_cli[['job', 'marital', 'education', 'default', 'housing', 'loan']]
le = preprocessing.LabelEncoder()
X_2 = X.apply(le.fit_transform)
X_2.values
enc = preprocessing.OneHotEncoder()
enc.fit(X_2)

one_hot_labels = enc.transform(X_2).toarray()
one_hot_labels.shape #(41188, 33)

#3. concatenating numeric and categorical variables
X = np.concatenate((bank_cli.values[:,0].reshape((41188,1)),one_hot_labels), axis = 1)
X.shape

X = X.astype(float)
X_fit = StandardScaler().fit_transform(X)

X_fit

#4. function to calculate k-means for 2 to 22 clusters
def calcular_cotovelo(data):
    wcss = []
    for i in range(2, 23):
        kmeans = KMeans(init = 'k-means++', n_init= 12, n_clusters = i)
        kmeans.fit(data)
        wcss.append(kmeans.inertia_)
    return wcss

cotovelo = calcular_cotovelo(X_fit)

#5. plot to see the elbow to select the ideal number of clusters
plt.plot(cotovelo)
plt.show()

This is the plot of the inertia to select the clusters. It's not in an elbow shape, and the values are very high.

enter image description here


Solution

  • K-means is not suited for categorical data. You should look to k-prototypes instead which combines k-modes and k-means and is able to cluster mixed numerical and categorical data.

    An implementation of k-prototypes is available in Python.

    If you consider only the numerical variable however, you can see an elbow with k-means criteria:

    k-means on numerical data only

    To understand why you do not see any elbow (with k-means on both numerical and categorical data), you can look at the number of points per clusters. You can observe that each time you increase the number of clusters, a new cluster is formed with only a few points which were in a big cluster at the previous step, thus the criterion is only a few less than at the previous step.