I was testing sklearn's Pipeline
on a randomly generated classification problem:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
x, y = make_classification(n_samples=100, n_features=5, random_state=10)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.2, random_state=0)
model = DecisionTreeClassifier(random_state=0)
pipe = Pipeline(steps=[('scale', StandardScaler()),
('poly', PolynomialFeatures(degree=2, include_bias=False)),
('model', model)])
pipe.fit(x_train, y_train)
pipe_pred = pipe.predict(x_test)
accuracy_score(y_test, pipe_pred)
This results in an accuracy score of .85
. However, when I change the PolynomialFeatures
argument include_bias
to True
, which just inserts a single column of 1s into the array, the accuracy score becomes .90
. For visualization, below I have plotted the individual trees for the results when bias is True
and when False
:
include_bias=True
: Trueinclude_bias=False
: FalseThese images were generated by plot_tree(pipe['model'])
.
The datasets are the same except when include_bias=True
an additional column of 1s is inserted into column 0. So the column indexes for the include_bias=True
data correspond to the i + 1
column index in the include_bias=False
data. (e.g. with_bias[:, 5] == without_bias[:, 4]
)
Based on my understanding, the column of 1s shouldn't have an impact on the Decision Tree. What am I missing?
From the documentation for DecisionTreeClassifier
:
random_state : int, RandomState instance, default=None
Controls the randomness of the estimator. The features are always randomly permuted at each split, even ifsplitter
is set to"best"
. Whenmax_features < n_features
, the algorithm will select max_features at random at each split before finding the best split among them. But the best found split may vary across different runs, even ifmax_features=n_features
. That is the case, if the improvement of the criterion is identical for several splits and one split has to be selected at random. To obtain a deterministic behaviour during fitting,random_state
has to be fixed to an integer. See Glossary for details.
You've set the random_state
, but having a different number of columns will nevertheless make those random shuffles different. Note that the value of gini
is the same for both of your trees at each node, even though different features are making the splits.