Search code examples
pythonmachine-learningxgboost

How can I implement incremental training for xgboost?


The problem is that my train data could not be placed into RAM due to train data size. So I need a method which first builds one tree on whole train data set, calculate residuals build another tree and so on (like gradient boosted tree do). Obviously if I call model = xgb.train(param, batch_dtrain, 2) in some loop - it will not help, because in such case it just rebuilds whole model for each batch.


Solution

  • Try saving your model after you train on the first batch. Then, on successive runs, provide the xgb.train method with the filepath of the saved model.

    Here's a small experiment that I ran to convince myself that it works:

    First, split the boston dataset into training and testing sets. Then split the training set into halves. Fit a model with the first half and get a score that will serve as a benchmark. Then fit two models with the second half; one model will have the additional parameter xgb_model. If passing in the extra parameter didn't make a difference, then we would expect their scores to be similar.. But, fortunately, the new model seems to perform much better than the first.

    import xgboost as xgb
    from sklearn.cross_validation import train_test_split as ttsplit
    from sklearn.datasets import load_boston
    from sklearn.metrics import mean_squared_error as mse
    
    X = load_boston()['data']
    y = load_boston()['target']
    
    # split data into training and testing sets
    # then split training set in half
    X_train, X_test, y_train, y_test = ttsplit(X, y, test_size=0.1, random_state=0)
    X_train_1, X_train_2, y_train_1, y_train_2 = ttsplit(X_train, 
                                                         y_train, 
                                                         test_size=0.5,
                                                         random_state=0)
    
    xg_train_1 = xgb.DMatrix(X_train_1, label=y_train_1)
    xg_train_2 = xgb.DMatrix(X_train_2, label=y_train_2)
    xg_test = xgb.DMatrix(X_test, label=y_test)
    
    params = {'objective': 'reg:linear', 'verbose': False}
    model_1 = xgb.train(params, xg_train_1, 30)
    model_1.save_model('model_1.model')
    
    # ================= train two versions of the model =====================#
    model_2_v1 = xgb.train(params, xg_train_2, 30)
    model_2_v2 = xgb.train(params, xg_train_2, 30, xgb_model='model_1.model')
    
    print(mse(model_1.predict(xg_test), y_test))     # benchmark
    print(mse(model_2_v1.predict(xg_test), y_test))  # "before"
    print(mse(model_2_v2.predict(xg_test), y_test))  # "after"
    
    # 23.0475232194
    # 39.6776876084
    # 27.2053239482
    

    reference: https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/training.py