python machine-learning scikit-learn random-forest

"Warm Start" in combination with new data leads to broadcasting error when predicting with Random Forest

I am trying to train a random forest model with sklearn. I have some original data (x, y) that I use to train the RF initially with.

from sklearn.ensemble import RandomForestClassifier
import numpy as np
x = np.random.rand(30,20)
y = np.round(np.random.rand(30))
rf = RandomForestClassifier()
rf.fit(x,y)

Now I get some new data that I want to use to retrain the model, but I want to keep the already existing trees in the rf untouched. So I set warm_start=True and add additional trees.

x_new = np.random.rand(5,20)
y_new = np.round(np.random.rand(5))
rf.n_estimators +=100
rf.warm_start = True
rf.fit(x_new,y_new)

So far so good. Everything works. But when I make predictions I get an error:

rf.predict(x)
>>> ValueError: non-broadcastable output operand with shape (30,1) doesn't match the broadcast shape (30,2)

Why does this happen?

Solution

This runs fine for me in colab, with the same sklearn version 1.2.2 mentioned. I suspect the issue is similar to what was indicated by the now-deleted answer, as well as Scikit-learn Randomforest with warm_start results (non-broadcastable output ...): one of your datasets (in this case, the y_new) doesn't have the same classes.

I tested this by setting y_new = np.ones(5) instead of random, and I get the same error; so I think you were just unlucky and got all ones or all zeros with whatever random seed numpy had when you ran this the first time.