I have been trying to build a RandomForestClassifier()
(RF) model and a DecisionTreeClassifier()
(DT) model in order to get the same output (only for learning purposes). I have found some questions with answers where I used those answers to build this code, like the required parameters to make both models equal but I can't find a code that actually does it, so I'm trying build that code:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
random_seed = 42
X, y = make_classification(
n_samples=100000,
n_features=5,
random_state=random_seed
)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=random_seed)
DT = DecisionTreeClassifier(criterion='gini', # default
splitter='best', # default
max_depth=None, # default
min_samples_split=3, # default
min_samples_leaf=1, # default
min_weight_fraction_leaf=0.0, # default
max_features=None, # default
random_state=random_seed, # NON-default
max_leaf_nodes=None, # default
min_impurity_decrease=0.0, # default
class_weight=None, # default
ccp_alpha=0.0 # default
)
DT.fit(X_train, y_train)
RF = RandomForestClassifier(n_estimators=1, # NON-default
criterion='gini', # default
max_depth=None, # default
min_samples_split=3, # default
min_samples_leaf=1, # default
min_weight_fraction_leaf=0.0, # default
max_features=None, # NON-default
max_leaf_nodes=None, # default
min_impurity_decrease=0.0, # default
bootstrap=False, # NON-default
oob_score=False, # default
n_jobs=None, # default
random_state=random_seed, # NON-default
verbose=0, # default
warm_start=False, # default
class_weight=None, # default
ccp_alpha=0.0, # default
max_samples=None # default
)
RF.fit(X_train, y_train)
RF_pred = RF.predict(X_test)
RF_proba = RF.predict_proba(X_test)
DT_pred = DT.predict(X_test)
DT_proba = DT.predict_proba(X_test)
# Here we validate that the outputs are actually equal, with their respective percentage of how many rows are NOT equal
print('If DT_pred = RF_pred:',np.array_equal(DT_pred, RF_pred), '; Percentage of not equal:', (DT_pred != RF_pred).sum()/len(DT_pred))
print('If DT_proba = RF_proba:', np.array_equal(DT_proba, RF_proba), '; Percentage of not equal:', (DT_proba != RF_proba).sum()/len(DT_proba))
# A plot that shows where those differences are concentrated
sns.set(style="darkgrid")
mask = (RF_proba[:,1] - DT_proba[:,1]) != 0
only_differences = (RF_proba[:,1] - DT_proba[:,1])[mask]
sns.kdeplot(only_differences, shade=True, color="r")
plt.title('Plot of only differences in probs scores')
plt.show()
Output:
I even found an answer that compares an XGBoost with DecisionTree saying they are almost identical, and when I test their probabilities outputs they are fairly different.
So, am I doing something wrong here? How can I get the same probabilities for those two models? Is there a possibility to get True
for those two print()
statements in the code above?
It appears to be due to random states, despite your best efforts. For the random forest to be effective at its randomization, it needs to provide each component decision tree with a different random state (using sklearn.ensemble._base._set_random_states
, source). You can check in your code that while RF.random_state
and DT.random_state
are both 42, RF.estimators_[0].random_state
is 1608637542.
When bootstrap=False
and max_columns=None
, this is only changing some effects for tied-gain splits I believe, and so the results are very close on the training set. That can translate to slightly larger differences on a test set.