Search code examples
pythonmachine-learningscikit-learnxgboostxgbregressor

Why does my machine learning model perform poorly with batch training?


My machine learning model (xgboost regressor) seems to perform worse when training in batches (i.e epochs > 1). If I change the number of epochs to 1 (i.e. no batches), my model score is near 93%. That's great However, when I set the number of batches to 25 or 100, the out of sample model score gets really bad as the number of epochs increases. By the last batch, the model out of sample score is extremely poor and cannot predict anything well! Does anyone see an issue with my code below? Thanks in advance!

Edit: genSold is a generator over my entire database.

epochs = 100
batchSize = (int)(nSold / epochs) 
print(batchSize)
print(batchSize * epochs)
model = xgboost.XGBRegressor()
for epoch in range(epochs):
    print(f"Epoch {epoch+1} of {epochs}")
    data = []
    count = 0
    for item in genSold:
        if(count == batchSize):
            break
        data.append(item)
        count += 1
    print(len(data))
    df = shuffle(pd.DataFrame(data))
    df2 = processData(df, numerical_features, categorical_features)
    df2.drop(columns=['house-id_listing'], inplace=True)
    df2 = df2.dropna(subset=prediction)
    Y = df2[prediction]
    X = df2.drop(columns=prediction)
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.1)
    if epoch == 0:
        model.fit(X_train, Y_train)
    else:
        features = model.get_booster().feature_names
        print(len(features))
        model.fit(X_train[features], Y_train, xgb_model=model.get_booster())
    print(model.score(X_test, Y_test))

Solution

  • model.fit using the sklearn API will not update existing trees, only fit new ones to the new dataset. You can train an existing model incrimentally using the python API, but as of 2018, batch training was not recommended by the devs. If you must do so, you need to pass over the entire training set at least several times to replicate the performance of training in a single batch.

    Edit: This assumption was wrong.

    If I understand your code correctly, you are training on the first batchSize samples in genSold repeatedly, as samples are never removed from genSold.

    However, if that were the case I would expect your score reported in the last line to improve, as you shuffle each batch before splitting into train and test folds, which after the first batch, should mean you are testing on samples you have previously trained on. Do you mean it is performing poorly on a separate hold-out set?