In the latest scikit-learn
release (1.3), it was announced that DecisionTreeClassifier
now supports missing values. The implementation evaluates splits with missing values going either to the left or right nodes (see release highlights).
However, when I tried using RandomForestClassifier
, which is an ensemble of DecisionTreeClassifier
s, it appears that it doesn't support missing values in the same way. I assumed that since RandomForestClassifier
inherits from DecisionTreeClassifier
, it would also support missing values.
Here's a simple snippet I used for testing:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
X = np.array([0, 1, 6, np.nan]).reshape(-1, 1)
y = [0, 0, 1, 1]
forest = RandomForestClassifier(random_state=0).fit(X, y)
predictions = forest.predict(X)
This throws the following error related to the presence of missing values:
ValueError: Input X contains NaN. RandomForestClassifier does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values
The same code with DecisionTreeClassifier
works just fine. Can anyone help explain why the RandomForestClassifier
doesn't support missing values, despite being an ensemble of DecisionTreeClassifier
s?
As pointed out by Ben Reiniger in the comments, people are acutally working on this feature. From scikit-learn
's release history I found out that RandomForestClassifier
and RandomForestRegressor
will support missing values from version 1.4
on: https://scikit-learn.org/dev/whats_new/v1.4.html#sklearn-ensemble