Search code examples
pythonmachine-learningxgboostquantilequantization

What is the mathematical definition of the quantile transformation in xgboost.QuantileDMatrix?


The XGBoost package provides the function xgboost.QuantileDMatrix which takes a numpy.ndarray or pandas.DataFrame as input, applies quantile transformation and stores the data in a sparse representation to improve performance. To the best of my knowledge, if the parameter max_bin is set to be equal or larger to the number of samples in the input data (max_bin>=number_of_samples) then the quantile transformation has no effect since each data point is represented by the median of itself. However, if you do that and inspect the data afterwards with QuantileDMatrix.get_data().data you will find that the lowest value in the data is always replaced by a completely different value. If you have p features, then it will replace one value for each feature.

So how QuantileDMatrix really works? How this quantisation is defined mathematically?

How to reproduce:

import xgboost as xgb
import pandas as pd
import numpy as np

# define data with numpy
feature1 = np.array([1,2,3,4])

# put it into pandas
a = pd.DataFrame({'feature1': feature1})

quantized_a = xgb.QuantileDMatrix(a, max_bin = 4)

# to show that the behaviour is consistent both with pandas and numpy
quantized_feature1 = xgb.QuantileDMatrix(feature1.reshape(-1, 1), max_bin = 4)

print(quantized_a.get_data().data)
print(quantized_feature1.get_data().data)
# output: [-1.e-05, 2.e+00, 3.e+00, 4.e+00 ]

# different data yields similar problem
feature2 = np.array([10399., 34552., -48585., 70.])
quantized_feature2 = xgb.QuantileDMatrix(feature2.reshape(-1, 1), max_bin = 4)
print(quantized_feature2.get_data().data)

np.testing.assert_almost_equal(feature2, quantized_feature2.get_data().data)
# Arrays are not almost equal to 7 decimals

# Mismatched elements: 1 / 4 (25%)
# Max absolute difference: 48585.
# Max relative difference: 0.5
# x: array([ 10399.,  34552., -48585.,     70.])
# y: array([ 1.0399e+04,  3.4552e+04, -9.7170e+04,  7.0000e+01], dtype=float32)
# in this case -48686 is the value affected, the lowest. 
# If you make it positive, then the value affected 
# is 70 which becomes the lowest one

Here are the requirements:

xgboost>=1.7.6
numpy>=1.23.5
pandas>=1.5.7

Solution

  • Each data point is actually replaced with the lower bound of each quantile bin. For the smallest bin, the lower bound is -inf. But instead of -inf, the developers use min(2x, 0)-1.e-05. However, the developers acknowledged that min(2x, 0)-1.e-05 is not really a good surrogate for -inf and -inf should be used directly [1].

    Open issues on Github related to this function:

    [1] https://github.com/dmlc/xgboost/issues/2914

    [2] https://github.com/dmlc/xgboost/issues/9680