I am trying to train a random forest model with sklearn
. I have some original data (x
, y
) that I use to train the RF initially with.
from sklearn.ensemble import RandomForestClassifier
import numpy as np
x = np.random.rand(30,20)
y = np.round(np.random.rand(30))
rf = RandomForestClassifier()
rf.fit(x,y)
Now I get some new data that I want to use to retrain the model, but I want to keep the already existing trees in the rf
untouched. So I set warm_start=True
and add additional trees.
x_new = np.random.rand(5,20)
y_new = np.round(np.random.rand(5))
rf.n_estimators +=100
rf.warm_start = True
rf.fit(x_new,y_new)
So far so good. Everything works. But when I make predictions I get an error:
rf.predict(x)
>>> ValueError: non-broadcastable output operand with shape (30,1) doesn't match the broadcast shape (30,2)
Why does this happen?
This runs fine for me in colab, with the same sklearn version 1.2.2 mentioned. I suspect the issue is similar to what was indicated by the now-deleted answer, as well as Scikit-learn Randomforest with warm_start results (non-broadcastable output ...): one of your datasets (in this case, the y_new
) doesn't have the same classes.
I tested this by setting y_new = np.ones(5)
instead of random, and I get the same error; so I think you were just unlucky and got all ones or all zeros with whatever random seed numpy had when you ran this the first time.