Search code examples
pythonmachine-learningscikit-learndecision-tree

Why does a column of 1s impact the results of a decision tree classifier?


I was testing sklearn's Pipeline on a randomly generated classification problem:

import numpy as np
import pandas as pd

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

x, y = make_classification(n_samples=100, n_features=5, random_state=10)

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.2, random_state=0)

model = DecisionTreeClassifier(random_state=0)
pipe = Pipeline(steps=[('scale', StandardScaler()),
                       ('poly', PolynomialFeatures(degree=2, include_bias=False)),
                       ('model', model)])
pipe.fit(x_train, y_train)

pipe_pred = pipe.predict(x_test)
accuracy_score(y_test, pipe_pred)

This results in an accuracy score of .85. However, when I change the PolynomialFeatures argument include_bias to True, which just inserts a single column of 1s into the array, the accuracy score becomes .90. For visualization, below I have plotted the individual trees for the results when bias is True and when False:

When include_bias=True: True

When include_bias=False: False

These images were generated by plot_tree(pipe['model']).

The datasets are the same except when include_bias=True an additional column of 1s is inserted into column 0. So the column indexes for the include_bias=True data correspond to the i + 1 column index in the include_bias=False data. (e.g. with_bias[:, 5] == without_bias[:, 4])

Based on my understanding, the column of 1s shouldn't have an impact on the Decision Tree. What am I missing?


Solution

  • From the documentation for DecisionTreeClassifier:

    random_state : int, RandomState instance, default=None
    Controls the randomness of the estimator. The features are always randomly permuted at each split, even if splitter is set to "best". When max_features < n_features, the algorithm will select max_features at random at each split before finding the best split among them. But the best found split may vary across different runs, even if max_features=n_features. That is the case, if the improvement of the criterion is identical for several splits and one split has to be selected at random. To obtain a deterministic behaviour during fitting, random_state has to be fixed to an integer. See Glossary for details.

    You've set the random_state, but having a different number of columns will nevertheless make those random shuffles different. Note that the value of gini is the same for both of your trees at each node, even though different features are making the splits.