Search code examples
pythonmachine-learningscikit-learnk-means

Using K-means predict after one hot encoding throws error. Number of columns from before one hot encoding affecting?


I am using K-means clustering on a dataset with some categorical features. I have some old code that operated on non-categorical data and the sequence of doing fit, then predict works as expected.

So now I am modifying that working code to work on a dataset that has some categorical features, thus requiring one hot encoding. This is where everything goes a bit pear shaped.

It seems the predict method call is expecting the old number of columns from before one hot encoding was carried out. The dataset, after dropping the Target column has 17 columns. Then after one hot encoding it has 29 columns.

Here is my code:

import pandas as pd
import numpy as np

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from google.colab import drive
drive.mount('/gdrive')
#Change current working directory to gdrive
%cd /gdrive

#Read files
inputFileA = r'/gdrive/My Drive/FilenameA.csv'
trainDataA = pd.read_csv(inputFileA) #creates a dataframe
print(trainDataA.shape)

#Extract training and test data
print("------------------\nShapes before dropping target column")
print(trainDataA.shape)
print(trainDataB.shape)
y_trainA = trainDataA["Revenue"]
X_trainA = trainDataA.drop(["Revenue"], axis=1) #extracting training data without target column
print("------------------\nShapes after dropping target column")
print(X_trainA.shape)

#categorical features of dataset A
categoricalFeaturesA = ["Month", "VisitorType","Weekend"]
data_processed_A = pd.get_dummies(X_trainA,prefix_sep="__",columns=categoricalFeaturesA)
print("---------------\nDataset A\n",data_processed_A.head())
data_processed_A.to_csv(r'/gdrive/My Drive/data_processed_A.csv')

#K-Means Clustering ========================================================================
#Default Mode - K=8
kmeans = KMeans()
data_processed_A_fit = data_processed_A
print("===================")
print("Shape of processed data: \n", data_processed_A_fit.shape)
data_processed_A_fit.to_csv(r'/gdrive/My Drive/data_processed_A_after_fit.csv')

kmeans.fit(data_processed_A_fit)
print("Online shoppers dataset");
print("\n============\nDataset A labels")
print(kmeans.labels_)
print("==============\n\nDataset A Clusters")
print(kmeans.cluster_centers_)
#Print Silhouette measure
print("\nDataset A silhouette_score:",silhouette_score(data_processed_A, kmeans.labels_))


df_kmeansA = data_processed_A
print(df_kmeansA.head())
print(df_kmeansA.dtypes)
kmeans_predict_trainA = kmeans.predict(df_kmeansA)

It throws an error at the last line:

ValueError: Incorrect number of features. Got 30 features, expected 29

So it seems to be expecting the dataset prior to one hot encoding, but I can't figure out why.

EDIT: As requested, here are the outputs.

(12330, 18)

------------------
Shapes before dropping target column
(12330, 18)
------------------
Shapes after dropping target column
(12330, 17)

---------------
Dataset A
    Administrative  Administrative_Duration  ...  Weekend__False  Weekend__True
0               0                      0.0  ...               1              0
1               0                      0.0  ...               1              0
2               0                      0.0  ...               1              0
3               0                      0.0  ...               1              0
4               0                      0.0  ...               0              1

[5 rows x 29 columns]

===================
Shape of processed data: 
 (12330, 29)

Online shoppers dataset

============
Dataset A labels
[1 1 1 ... 1 1 1]
==============

Dataset A Clusters
[[ 3.81805930e+00  1.38862225e+02  9.64959569e-01  6.74071040e+01
   5.82958221e+01  2.41869720e+03  7.61833487e-03  2.22516393e-02
   8.26725184e+00  5.21563342e-02  2.12398922e+00  2.27021563e+00
   3.19204852e+00  3.92318059e+00  3.70619946e-02  1.35444744e-01
   4.71698113e-03  3.77358491e-02  1.95417790e-02  1.04447439e-01
   2.58086253e-01  3.29514825e-01  4.38005391e-02  2.96495957e-02
   6.13207547e-02  1.34770889e-03  9.37331536e-01  7.85040431e-01
   2.14959569e-01]
 [ 1.30855956e+00  4.07496939e+01  2.02343866e-01  1.04729641e+01
   1.08400831e+01  2.55187795e+02  3.36811071e-02  5.91174011e-02
   3.49331537e+00  6.83281412e-02  2.12030856e+00  2.36315087e+00
   3.17015280e+00  4.23497997e+00  3.32294912e-02  1.38851802e-01
   2.24002374e-02  3.85699451e-02  2.50704643e-02  1.79943629e-01
   2.89126242e-01  1.90921228e-01  4.56905504e-02  3.61964100e-02
   1.73713099e-01  1.00875241e-02  8.16199377e-01  7.76442664e-01
   2.23557336e-01]
 [ 6.91666667e+00  2.25307183e+02  2.61111111e+00  1.95093981e+02
   2.78583333e+02  1.23142325e+04  5.09377058e-03  1.83440117e-02
   4.99428623e+00  2.50000000e-02  2.06944444e+00  2.37500000e+00
   2.48611111e+00  3.40277778e+00  4.16666667e-02  5.55555556e-02
  -1.73472348e-17  2.77777778e-02  5.55555556e-02  2.77777778e-02
   1.11111111e-01  6.38888889e-01  2.77777778e-02  1.38888889e-02
   1.38888889e-02  1.30104261e-17  9.86111111e-01  7.36111111e-01
   2.63888889e-01]
 [ 1.10000000e+01  3.01400198e+03  1.50000000e+01  2.29990417e+03
   5.77000000e+02  5.35723778e+04  2.80784550e-03  2.15663890e-02
   3.81914478e-01  0.00000000e+00  2.00000000e+00  2.00000000e+00
   1.00000000e+00  8.00000000e+00  0.00000000e+00  5.00000000e-01
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   5.00000000e-01  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  1.00000000e+00  5.00000000e-01
   5.00000000e-01]
 [ 4.86342229e+00  1.77626788e+02  1.34065934e+00  1.05527518e+02
   9.92605965e+01  4.34306889e+03  6.80965548e-03  2.14584490e-02
   8.25797938e+00  5.40031397e-02  2.15855573e+00  2.39089482e+00
   3.01726845e+00  3.65934066e+00  3.61067504e-02  1.30298273e-01
   3.13971743e-03  2.98273155e-02  1.25588697e-02  6.90737834e-02
   1.97802198e-01  4.41130298e-01  4.55259027e-02  3.45368917e-02
   2.66875981e-02  3.13971743e-03  9.70172684e-01  7.66091052e-01
   2.33908948e-01]
 [ 6.85123967e+00  2.23415936e+02  2.29338843e+00  1.93528478e+02
   1.64049587e+02  7.41594639e+03  6.53738660e-03  2.02325121e-02
   5.16682694e+00  3.63636364e-02  2.16115702e+00  2.28925620e+00
   2.80991736e+00  3.44628099e+00  2.89256198e-02  9.09090909e-02
   4.13223140e-03  4.54545455e-02  4.54545455e-02  5.37190083e-02
   1.15702479e-01  5.28925620e-01  3.71900826e-02  4.95867769e-02
   4.13223140e-03  4.13223140e-03  9.91735537e-01  7.68595041e-01
   2.31404959e-01]
 [ 2.74824952e+00  1.00268631e+02  5.53150859e-01  3.65903439e+01
   3.26989179e+01  1.15670886e+03  9.20109170e-03  2.52780038e-02
   9.51099189e+00  5.55060471e-02  2.12412476e+00  2.38415022e+00
   3.15085933e+00  3.92520687e+00  3.81922342e-02  1.52450668e-01
   7.32017823e-03  2.60980267e-02  2.13239975e-02  1.52768937e-01
   2.76575430e-01  2.42838956e-01  4.32845321e-02  3.91470401e-02
   1.31444940e-01  3.81922342e-03  8.64735837e-01  7.40292807e-01
   2.59707193e-01]
 [ 1.48000000e+01  1.12191581e+03  4.80000000e+00  6.74591667e+02
   4.78400000e+02  2.32310689e+04  6.77737780e-03  2.03073056e-02
   4.29149073e+00 -6.93889390e-18  1.90000000e+00  2.10000000e+00
   1.70000000e+00  4.90000000e+00  1.00000000e-01  1.00000000e-01
   3.46944695e-18  2.00000000e-01  3.46944695e-18 -2.77555756e-17
   0.00000000e+00  4.00000000e-01 -6.93889390e-18  2.00000000e-01
   2.77555756e-17  8.67361738e-19  1.00000000e+00  9.00000000e-01
   1.00000000e-01]]

Dataset A silhouette_score: 0.564190293354119

   Administrative  Administrative_Duration  ...  Weekend__True  Cluster Number
0               0                      0.0  ...              0               1
1               0                      0.0  ...              0               1
2               0                      0.0  ...              0               1
3               0                      0.0  ...              0               1
4               0                      0.0  ...              1               1

[5 rows x 30 columns]
Administrative                      int64
Administrative_Duration           float64
Informational                       int64
Informational_Duration            float64
ProductRelated                      int64
ProductRelated_Duration           float64
BounceRates                       float64
ExitRates                         float64
PageValues                        float64
SpecialDay                        float64
OperatingSystems                    int64
Browser                             int64
Region                              int64
TrafficType                         int64
Month__Aug                          uint8
Month__Dec                          uint8
Month__Feb                          uint8
Month__Jul                          uint8
Month__June                         uint8
Month__Mar                          uint8
Month__May                          uint8
Month__Nov                          uint8
Month__Oct                          uint8
Month__Sep                          uint8
VisitorType__New_Visitor            uint8
VisitorType__Other                  uint8
VisitorType__Returning_Visitor      uint8
Weekend__False                      uint8
Weekend__True                       uint8
Cluster Number                      int32
dtype: object
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-13-cf4258f963fa> in <module>()
      3 print(df_kmeansA.head())
      4 print(df_kmeansA.dtypes)
----> 5 kmeans_predict_trainA = kmeans.predict(df_kmeansA)
      6 df_kmeansA['Cluster Number'] = kmeans_predict_trainA
      7 

1 frames
/usr/local/lib/python3.6/dist-packages/sklearn/cluster/_kmeans.py in _check_test_data(self, X)
    815             raise ValueError("Incorrect number of features. "
    816                              "Got %d features, expected %d" % (
--> 817                                  n_features, expected_n_features))
    818 
    819         return X

ValueError: Incorrect number of features. Got 30 features, expected 29

Solution

  • it seems to be expecting the dataset prior to one hot encoding

    It is not; if it were so, it would ask for 17 features, not 29 as it does:

    ValueError: Incorrect number of features. Got 30 features, expected 29
    

    So, it complains for one more feature than expected; and looking closely at your print outs, it is apparent that the result of

    print(df_kmeansA.head())
    

    is a print out of [5 rows x 30 columns] containing a column Cluster Number. Nevertheless, your KMeans was fitted with data_processed_A_fit, which has a

    ===================
    Shape of processed data: 
     (12330, 29)
    

    and no Cluster Number column.

    It would certainly seem that, despite that you set data_processed_A_fit = data_processed_A and df_kmeansA = data_processed_A, there is a piece of code not shown here, where you add the Cluster Number column in the data_processed_A dataframe, hence the error.