Search code examples
pythonboostmachine-learningxgboostgradient-descent

How to get or see xgboost's gradient statistics value?


I'm studying xgboost and novice for gradient boost. In gradient tree boosting, loss function is derived by second order approximation calculating gi, hi. You can see it on https://xgboost.readthedocs.io/en/latest/model.html#the-structure-score . Given a dataset, how can I see the value gi, hi such as g1, h1, g2, h2,..?

I saw _train_internal and several functions in training.py and sklean.py. But I didn't find it. By understanding how it calculated and got efficiently, it could be possible to apply further algorithm used in xgboost such as quantile percentile sketch.

Thanks.


Solution

  • To keep track of the gradient updates in each iteration, you would need to expose the training loop in python (rather than have it execute internally in the C++ implementation), and provide custom gradient and hessian implementations. For many of the standard loss functions, e.g., squared loss, logistic loss, this is quite simple and not hard to find in standard references. Here's an example showing how to expose the training loop for logistic regression.

    import numpy as np
    import xgboost as xgb
    from sklearn.datasets import make_classification
    from sklearn.metrics import confusion_matrix
    
    
    def sigmoid(x):
        return 1 / (1 + np.exp(-x))
    
    
    def logregobj(preds, dtrain):
        """log likelihood loss"""
        labels = dtrain.get_label()
        preds = sigmoid(preds)
        grad = preds - labels
        hess = preds * (1.0-preds)
        return grad, hess
    
    
    # Build a toy dataset.
    X, Y = make_classification(n_samples=1000, n_features=5, n_redundant=0, n_informative=3,
                               random_state=1, n_clusters_per_class=1)
    
    # Instantiate a Booster object to do the heavy lifting
    dtrain = xgb.DMatrix(X, label=Y)
    params = {'max_depth': 2, 'eta': 1, 'silent': 1}
    num_round = 2
    model = xgb.Booster(params, [dtrain])
    
    # Run 10 boosting iterations
    # g and h can be monitored for gradient statistics
    for _ in range(10):
        pred = model.predict(dtrain)
        g, h = logregobj(pred, dtrain)
        model.boost(dtrain, g, h)
    
    # Evaluate predictions    
    yhat = model.predict(dtrain)
    yhat = 1.0 / (1.0 + np.exp(-yhat))
    yhat_labels = np.round(yhat)
    confusion_matrix(Y, yhat_labels)