Search code examples
pythonscikit-learnregressionsklearn-pandas

sklearn tree treats categorical variable as float during splits, how should I solve this?


I'm having trouble with my regression tree using the sklearn package. It's about a book dataset, in which the regression tree can be seen below:

sklearn tree

The problem is in the STORY_LANGUAGE variable. This is a categorical variable with the values 0, 1, 2, and 3, which all correspond to a different language of the book. Before running the model, I've made sure that STORY_LANGUAGE is a categorical variable, yet the tree still splits it and treats it as a float (1.5).

How should I solve this? Any help is appreciated!


Solution

  • By passing a list of integers as features to scikit-learn, you're telling it that there's some sort dependence between the features. That e.g. 0 is closer related to 1 than to 2. To get around this, you will need to do one-hot encoding with the built-in OneHotEncoder. If you have three categories, 0, 1 and 2, a 0 will be converted to [1,0,0], while a 1 will be converted to [0,1,0]. Basically your one feature is replaced with a vector that is equal to 1 at a position corresponding to which class it is and 0 otherwise.

    import numpy as np
    from sklearn.preprocessing import OneHotEncoder
    
    # Generate random integers between 0 and 2
    x = np.random.randint(0,3, size=(100,1))
    # Create the one-hot encoder object, specifying not to use sparse arrays.
    m = OneHotEncoder(sparse=False)
    # Transform your features
    x_one_hot = m.fit_transform(x)