Search code examples
pythonmachine-learningscikit-learnrandom-forest

"Warm Start" in combination with new data leads to broadcasting error when predicting with Random Forest


I am trying to train a random forest model with sklearn. I have some original data (x, y) that I use to train the RF initially with.

from sklearn.ensemble import RandomForestClassifier
import numpy as np
x = np.random.rand(30,20)
y = np.round(np.random.rand(30))
rf = RandomForestClassifier()
rf.fit(x,y)

Now I get some new data that I want to use to retrain the model, but I want to keep the already existing trees in the rf untouched. So I set warm_start=True and add additional trees.

x_new = np.random.rand(5,20)
y_new = np.round(np.random.rand(5))
rf.n_estimators +=100
rf.warm_start = True
rf.fit(x_new,y_new)

So far so good. Everything works. But when I make predictions I get an error:

rf.predict(x)
>>> ValueError: non-broadcastable output operand with shape (30,1) doesn't match the broadcast shape (30,2)

Why does this happen?


Solution

  • This runs fine for me in colab, with the same sklearn version 1.2.2 mentioned. I suspect the issue is similar to what was indicated by the now-deleted answer, as well as Scikit-learn Randomforest with warm_start results (non-broadcastable output ...): one of your datasets (in this case, the y_new) doesn't have the same classes.

    I tested this by setting y_new = np.ones(5) instead of random, and I get the same error; so I think you were just unlucky and got all ones or all zeros with whatever random seed numpy had when you ran this the first time.