python scikit-learn regression sklearn-pandas

sklearn tree treats categorical variable as float during splits, how should I solve this?

I'm having trouble with my regression tree using the sklearn package. It's about a book dataset, in which the regression tree can be seen below:

The problem is in the STORY_LANGUAGE variable. This is a categorical variable with the values 0, 1, 2, and 3, which all correspond to a different language of the book. Before running the model, I've made sure that STORY_LANGUAGE is a categorical variable, yet the tree still splits it and treats it as a float (1.5).

How should I solve this? Any help is appreciated!

Solution

By passing a list of integers as features to scikit-learn, you're telling it that there's some sort dependence between the features. That e.g. 0 is closer related to 1 than to 2. To get around this, you will need to do one-hot encoding with the built-in OneHotEncoder. If you have three categories, 0, 1 and 2, a 0 will be converted to [1,0,0], while a 1 will be converted to [0,1,0]. Basically your one feature is replaced with a vector that is equal to 1 at a position corresponding to which class it is and 0 otherwise.

import numpy as np
from sklearn.preprocessing import OneHotEncoder

# Generate random integers between 0 and 2
x = np.random.randint(0,3, size=(100,1))
# Create the one-hot encoder object, specifying not to use sparse arrays.
m = OneHotEncoder(sparse=False)
# Transform your features
x_one_hot = m.fit_transform(x)