I'm having trouble with my regression tree using the sklearn package. It's about a book dataset, in which the regression tree can be seen below:
The problem is in the STORY_LANGUAGE
variable. This is a categorical variable with the values 0, 1, 2, and 3, which all correspond to a different language of the book. Before running the model, I've made sure that STORY_LANGUAGE
is a categorical variable, yet the tree still splits it and treats it as a float (1.5).
How should I solve this? Any help is appreciated!
By passing a list of integers as features to scikit-learn, you're telling it that there's some sort dependence between the features. That e.g. 0
is closer related to 1
than to 2
. To get around this, you will need to do one-hot encoding with the built-in OneHotEncoder. If you have three categories, 0
, 1
and 2
, a 0
will be converted to [1,0,0]
, while a 1
will be converted to [0,1,0]
. Basically your one feature is replaced with a vector that is equal to 1
at a position corresponding to which class it is and 0
otherwise.
import numpy as np
from sklearn.preprocessing import OneHotEncoder
# Generate random integers between 0 and 2
x = np.random.randint(0,3, size=(100,1))
# Create the one-hot encoder object, specifying not to use sparse arrays.
m = OneHotEncoder(sparse=False)
# Transform your features
x_one_hot = m.fit_transform(x)