Search code examples
machine-learningcluster-analysisk-means

How do you make a KMeans prediction more accurate?


I'm learning about clustering and KMeans and such, so my knowldge is very basic on the topic. What I have below is a bit of a self study on how it works. Basically, if 'a' shows up in any of the columns, 'Binary' will equal 1. Essentially I am trying to teach it a pattern. I learned the following from a tutorial using the Titanic dataset, but I've adapted to my own data.

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
import seaborn as sns
import matplotlib.pyplot as plt

my constructed data

dataset = [
    [0,'x','f','g'],[1,'a','c','b'],[1,'d','k','a'],[0,'y','v','w'],
    [0,'q','w','e'],[1,'c','a','l'],[0,'t','x','j'],[1,'w','o','a'],
    [0,'z','m','n'],[1,'z','x','a'],[0,'f','g','h'],[1,'h','a','c'],
    [1,'a','r','e'],[0,'g','c','c']     
]

df = pd.DataFrame(dataset, columns=['Binary','Col1','Col2','Col3'])
df.head()

df:

Binary  Col1  Col2  Col3
------------------------
  1       a    b     c
  0       x    t     v
  0       s    q     w
  1       n    m     a
  1       u    a     r

Encode non binary to binary:

labelEncoder = LabelEncoder()
labelEncoder.fit(df['Col1'])
df['Col1'] = labelEncoder.transform(df['Col1'])

labelEncoder.fit(df['Col2'])
df['Col2'] = labelEncoder.transform(df['Col2'])

labelEncoder.fit(df['Col3'])
df['Col3'] = labelEncoder.transform(df['Col3'])

Set clusters to two, because its either 1 or 0?

X = np.array(df.drop(['Binary'], 1).astype(float))
y = np.array(df['Binary'])
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)

Test it:

correct = 0
for i in range(len(X)):
    predict_me = np.array(X[i].astype(float))
    predict_me = predict_me.reshape(-1, len(predict_me))
    prediction = kmeans.predict(predict_me)
    if prediction[0] == y[i]:
        correct += 1

The result:

print(f'{round(correct/len(X) * 100)}% Accuracy')
>>> 71%

How can I get it more accurate to the point where it 99.99% knows that 'a' means binary column is 1? More data?


Solution

  • K-means does not even try to predict this value. Because it is an unsupervised method. Because it is not a prediction algorithm; it is a structure discovery task. Don't mistake clustering for classification.

    The cluster numbers have no meaning. They are 0 and 1 because these are the first two integers. K-means is randomized. Run it a few times and you will also score just 29% sometimes.

    Also, k-means is designed for continuous input. You can apply it on binary encoded data, but the results will be pretty poor.