python machine-learning scikit-learn svm k-means

ValueError: Unknown label type: 'continuous' when using clustering + classification models together

I created a clustering model to try and find different groups of customers based on annual income and spending score using the KMeans algorithm from Scikit-Learn. Using the cluster value that it returned for each customer, I tried to create a classification model using Support Vector Classification from sklearn.svm. When I tried to fit the new model onto the dataset, however, I got an error message:

File "/Users/user/Documents/Machine Learning A-Z Template Folder/Part 4 - Clustering/Section 24 - K-Means Clustering/cluster_and_prediction.py", line 28, in <module>
    classifier.fit(x_train, y_train)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sklearn/svm/_base.py", line 149, in fit
    y = self._validate_targets(y)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sklearn/svm/_base.py", line 525, in _validate_targets
    check_classification_targets(y)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sklearn/utils/multiclass.py", line 169, in check_classification_targets
    raise ValueError("Unknown label type: %r" % y_type)
ValueError: Unknown label type: 'continuous'

My code is as follows

import pandas as pd 
import numpy as np 

# Using relevant columns from dataset
dataset = pd.read_csv('Mall_Customers.csv')
x = dataset.iloc[:, 3:5].values

# Creating model with ideal amount of clusters
kmeans = KMeans(n_clusters=5, init='k-means++', max_iter=300, n_init=10, random_state=0)
kmeans.fit(x)

predictions = kmeans.predict(x)

# Creating numpy array for feature scaling
predictions = np.array(predictions, dtype=int)
predictions = predictions[:, None]

from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
sc_y = StandardScaler()
x = sc_x.fit_transform(x)
predictions = sc_y.fit_transform(predictions)

# Splitting dataset into training and test sets
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, predictions, test_size=.25)

# Creating Support Vector Classification model
from sklearn.svm import SVC
classifier = SVC(kernel='rbf')
classifier.fit(x_train, y_train)

Elbow Model Used for Clustering

Clustering Visualization

.zip file with the dataset(the dataset is called 'Mall_Customers.csv'

How can I fix this?

Solution

Since you want to address this as a classification problem with 5 classes, you should not use a scaler for your labels; this converts them to continuous variables fed in a classification model, hence the error.

Also, irrelevant to the issue, but the correct methodology is to fit your scaler on your training data only, and then use this fitted scaler to transform your test data.

So, here are the necessary changes (after you have finished with setting your predictions variable):

# initial (unscaled) x used here:
x_train, x_test, y_train, y_test = train_test_split(x, predictions, test_size=.25)
sc = StandardScaler()
x_train_scaled = sc.fit_transform(x_train)
x_test_scaled = sc.transform(x_test)

classifier = SVC(kernel='rbf')
classifier.fit(x_train_scaled, y_train) # no scaling for predictions or y_train

Also irrelevant to the issue, but you should scale your x data before using k-means, i.e. you should actually scale your x first and then perform your clustering (leaving it as an exercise, as it has nothing to do with the error).