Search code examples
pythonmachine-learningscikit-learnone-hot-encoding

Isn't the purpose of Scikit fit_transform, ColumnTransformer and OneHotEncoder to code categorical data, so why is it used on numerical values


I was searching for machine learning examples to look at and understand and I stumbled upon this example: https://www.kaggle.com/saulalquicira/model-evaluation-using-cross-val-score-and-kfold

I understand everything in the code except for this part:

labelencoder_X = LabelEncoder()
X[:,2] = labelencoder_X.fit_transform(X[:,2])
ct = ColumnTransformer([("cp", OneHotEncoder(), [2])],    remainder = 'passthrough') 
X = ct.fit_transform(X)

ct = ColumnTransformer([("restecg", OneHotEncoder(), [9])],    remainder = 'passthrough')
X = ct.fit_transform(X)

ct = ColumnTransformer([("slope", OneHotEncoder(), [15])],    remainder = 'passthrough')
X = ct.fit_transform(X)

ct = ColumnTransformer([("ca", OneHotEncoder(), [18])],    remainder = 'passthrough')
X = ct.fit_transform(X)

ct = ColumnTransformer([("thal", OneHotEncoder(), [22])],    remainder = 'passthrough')
X = ct.fit_transform(X)

I understand what every individual keyword does, but why are we using this on values that are already numerical in nature, I thought we do this on categorical Data that is alphabetical in nature in order to transform it to numerical binary values that machine learning algorithms can understand. here is how the Dataset looks:

DataSet


Solution

  • The features which are being transformed here are technically numerical, but only in representation. You can see that they have already been integer / label-encoded however the data that they represent may be categorical in nature.

    When you are working with ordinal data (categorical but there is meaningful order to the feature, i.e. 1 < 2 < 3), label encoding is sufficient. If you are working with truly categorical values which have no meaningful order, it is still useful to one-hot encode or use some other technique to prevent your algorithm from falsely interpreting order from the data.