I am using K-means clustering on a dataset with some categorical features. I have some old code that operated on non-categorical data and the sequence of doing fit, then predict works as expected.
So now I am modifying that working code to work on a dataset that has some categorical features, thus requiring one hot encoding. This is where everything goes a bit pear shaped.
It seems the predict method call is expecting the old number of columns from before one hot encoding was carried out. The dataset, after dropping the Target column has 17 columns. Then after one hot encoding it has 29 columns.
Here is my code:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from google.colab import drive
drive.mount('/gdrive')
#Change current working directory to gdrive
%cd /gdrive
#Read files
inputFileA = r'/gdrive/My Drive/FilenameA.csv'
trainDataA = pd.read_csv(inputFileA) #creates a dataframe
print(trainDataA.shape)
#Extract training and test data
print("------------------\nShapes before dropping target column")
print(trainDataA.shape)
print(trainDataB.shape)
y_trainA = trainDataA["Revenue"]
X_trainA = trainDataA.drop(["Revenue"], axis=1) #extracting training data without target column
print("------------------\nShapes after dropping target column")
print(X_trainA.shape)
#categorical features of dataset A
categoricalFeaturesA = ["Month", "VisitorType","Weekend"]
data_processed_A = pd.get_dummies(X_trainA,prefix_sep="__",columns=categoricalFeaturesA)
print("---------------\nDataset A\n",data_processed_A.head())
data_processed_A.to_csv(r'/gdrive/My Drive/data_processed_A.csv')
#K-Means Clustering ========================================================================
#Default Mode - K=8
kmeans = KMeans()
data_processed_A_fit = data_processed_A
print("===================")
print("Shape of processed data: \n", data_processed_A_fit.shape)
data_processed_A_fit.to_csv(r'/gdrive/My Drive/data_processed_A_after_fit.csv')
kmeans.fit(data_processed_A_fit)
print("Online shoppers dataset");
print("\n============\nDataset A labels")
print(kmeans.labels_)
print("==============\n\nDataset A Clusters")
print(kmeans.cluster_centers_)
#Print Silhouette measure
print("\nDataset A silhouette_score:",silhouette_score(data_processed_A, kmeans.labels_))
df_kmeansA = data_processed_A
print(df_kmeansA.head())
print(df_kmeansA.dtypes)
kmeans_predict_trainA = kmeans.predict(df_kmeansA)
It throws an error at the last line:
ValueError: Incorrect number of features. Got 30 features, expected 29
So it seems to be expecting the dataset prior to one hot encoding, but I can't figure out why.
EDIT: As requested, here are the outputs.
(12330, 18)
------------------
Shapes before dropping target column
(12330, 18)
------------------
Shapes after dropping target column
(12330, 17)
---------------
Dataset A
Administrative Administrative_Duration ... Weekend__False Weekend__True
0 0 0.0 ... 1 0
1 0 0.0 ... 1 0
2 0 0.0 ... 1 0
3 0 0.0 ... 1 0
4 0 0.0 ... 0 1
[5 rows x 29 columns]
===================
Shape of processed data:
(12330, 29)
Online shoppers dataset
============
Dataset A labels
[1 1 1 ... 1 1 1]
==============
Dataset A Clusters
[[ 3.81805930e+00 1.38862225e+02 9.64959569e-01 6.74071040e+01
5.82958221e+01 2.41869720e+03 7.61833487e-03 2.22516393e-02
8.26725184e+00 5.21563342e-02 2.12398922e+00 2.27021563e+00
3.19204852e+00 3.92318059e+00 3.70619946e-02 1.35444744e-01
4.71698113e-03 3.77358491e-02 1.95417790e-02 1.04447439e-01
2.58086253e-01 3.29514825e-01 4.38005391e-02 2.96495957e-02
6.13207547e-02 1.34770889e-03 9.37331536e-01 7.85040431e-01
2.14959569e-01]
[ 1.30855956e+00 4.07496939e+01 2.02343866e-01 1.04729641e+01
1.08400831e+01 2.55187795e+02 3.36811071e-02 5.91174011e-02
3.49331537e+00 6.83281412e-02 2.12030856e+00 2.36315087e+00
3.17015280e+00 4.23497997e+00 3.32294912e-02 1.38851802e-01
2.24002374e-02 3.85699451e-02 2.50704643e-02 1.79943629e-01
2.89126242e-01 1.90921228e-01 4.56905504e-02 3.61964100e-02
1.73713099e-01 1.00875241e-02 8.16199377e-01 7.76442664e-01
2.23557336e-01]
[ 6.91666667e+00 2.25307183e+02 2.61111111e+00 1.95093981e+02
2.78583333e+02 1.23142325e+04 5.09377058e-03 1.83440117e-02
4.99428623e+00 2.50000000e-02 2.06944444e+00 2.37500000e+00
2.48611111e+00 3.40277778e+00 4.16666667e-02 5.55555556e-02
-1.73472348e-17 2.77777778e-02 5.55555556e-02 2.77777778e-02
1.11111111e-01 6.38888889e-01 2.77777778e-02 1.38888889e-02
1.38888889e-02 1.30104261e-17 9.86111111e-01 7.36111111e-01
2.63888889e-01]
[ 1.10000000e+01 3.01400198e+03 1.50000000e+01 2.29990417e+03
5.77000000e+02 5.35723778e+04 2.80784550e-03 2.15663890e-02
3.81914478e-01 0.00000000e+00 2.00000000e+00 2.00000000e+00
1.00000000e+00 8.00000000e+00 0.00000000e+00 5.00000000e-01
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
5.00000000e-01 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 1.00000000e+00 5.00000000e-01
5.00000000e-01]
[ 4.86342229e+00 1.77626788e+02 1.34065934e+00 1.05527518e+02
9.92605965e+01 4.34306889e+03 6.80965548e-03 2.14584490e-02
8.25797938e+00 5.40031397e-02 2.15855573e+00 2.39089482e+00
3.01726845e+00 3.65934066e+00 3.61067504e-02 1.30298273e-01
3.13971743e-03 2.98273155e-02 1.25588697e-02 6.90737834e-02
1.97802198e-01 4.41130298e-01 4.55259027e-02 3.45368917e-02
2.66875981e-02 3.13971743e-03 9.70172684e-01 7.66091052e-01
2.33908948e-01]
[ 6.85123967e+00 2.23415936e+02 2.29338843e+00 1.93528478e+02
1.64049587e+02 7.41594639e+03 6.53738660e-03 2.02325121e-02
5.16682694e+00 3.63636364e-02 2.16115702e+00 2.28925620e+00
2.80991736e+00 3.44628099e+00 2.89256198e-02 9.09090909e-02
4.13223140e-03 4.54545455e-02 4.54545455e-02 5.37190083e-02
1.15702479e-01 5.28925620e-01 3.71900826e-02 4.95867769e-02
4.13223140e-03 4.13223140e-03 9.91735537e-01 7.68595041e-01
2.31404959e-01]
[ 2.74824952e+00 1.00268631e+02 5.53150859e-01 3.65903439e+01
3.26989179e+01 1.15670886e+03 9.20109170e-03 2.52780038e-02
9.51099189e+00 5.55060471e-02 2.12412476e+00 2.38415022e+00
3.15085933e+00 3.92520687e+00 3.81922342e-02 1.52450668e-01
7.32017823e-03 2.60980267e-02 2.13239975e-02 1.52768937e-01
2.76575430e-01 2.42838956e-01 4.32845321e-02 3.91470401e-02
1.31444940e-01 3.81922342e-03 8.64735837e-01 7.40292807e-01
2.59707193e-01]
[ 1.48000000e+01 1.12191581e+03 4.80000000e+00 6.74591667e+02
4.78400000e+02 2.32310689e+04 6.77737780e-03 2.03073056e-02
4.29149073e+00 -6.93889390e-18 1.90000000e+00 2.10000000e+00
1.70000000e+00 4.90000000e+00 1.00000000e-01 1.00000000e-01
3.46944695e-18 2.00000000e-01 3.46944695e-18 -2.77555756e-17
0.00000000e+00 4.00000000e-01 -6.93889390e-18 2.00000000e-01
2.77555756e-17 8.67361738e-19 1.00000000e+00 9.00000000e-01
1.00000000e-01]]
Dataset A silhouette_score: 0.564190293354119
Administrative Administrative_Duration ... Weekend__True Cluster Number
0 0 0.0 ... 0 1
1 0 0.0 ... 0 1
2 0 0.0 ... 0 1
3 0 0.0 ... 0 1
4 0 0.0 ... 1 1
[5 rows x 30 columns]
Administrative int64
Administrative_Duration float64
Informational int64
Informational_Duration float64
ProductRelated int64
ProductRelated_Duration float64
BounceRates float64
ExitRates float64
PageValues float64
SpecialDay float64
OperatingSystems int64
Browser int64
Region int64
TrafficType int64
Month__Aug uint8
Month__Dec uint8
Month__Feb uint8
Month__Jul uint8
Month__June uint8
Month__Mar uint8
Month__May uint8
Month__Nov uint8
Month__Oct uint8
Month__Sep uint8
VisitorType__New_Visitor uint8
VisitorType__Other uint8
VisitorType__Returning_Visitor uint8
Weekend__False uint8
Weekend__True uint8
Cluster Number int32
dtype: object
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-13-cf4258f963fa> in <module>()
3 print(df_kmeansA.head())
4 print(df_kmeansA.dtypes)
----> 5 kmeans_predict_trainA = kmeans.predict(df_kmeansA)
6 df_kmeansA['Cluster Number'] = kmeans_predict_trainA
7
1 frames
/usr/local/lib/python3.6/dist-packages/sklearn/cluster/_kmeans.py in _check_test_data(self, X)
815 raise ValueError("Incorrect number of features. "
816 "Got %d features, expected %d" % (
--> 817 n_features, expected_n_features))
818
819 return X
ValueError: Incorrect number of features. Got 30 features, expected 29
it seems to be expecting the dataset prior to one hot encoding
It is not; if it were so, it would ask for 17 features, not 29 as it does:
ValueError: Incorrect number of features. Got 30 features, expected 29
So, it complains for one more feature than expected; and looking closely at your print outs, it is apparent that the result of
print(df_kmeansA.head())
is a print out of [5 rows x 30 columns]
containing a column Cluster Number
. Nevertheless, your KMeans was fitted with data_processed_A_fit
, which has a
===================
Shape of processed data:
(12330, 29)
and no Cluster Number
column.
It would certainly seem that, despite that you set data_processed_A_fit = data_processed_A
and df_kmeansA = data_processed_A
, there is a piece of code not shown here, where you add the Cluster Number
column in the data_processed_A
dataframe, hence the error.