pythonmachine-learningxgboostquantilequantization# What is the mathematical definition of the quantile transformation in xgboost.QuantileDMatrix?

The XGBoost package provides the function `xgboost.QuantileDMatrix`

which takes a `numpy.ndarray`

or `pandas.DataFrame`

as input, applies quantile transformation and stores the data in a sparse representation to improve performance. To the best of my knowledge, if the parameter `max_bin`

is set to be equal or larger to the number of samples in the input data (`max_bin>=number_of_samples`

) then the quantile transformation has no effect since each data point is represented by the median of itself. However, if you do that and inspect the data afterwards with `QuantileDMatrix.get_data().data`

you will find that the lowest value in the data is always replaced by a completely different value. If you have `p`

features, then it will replace one value for each feature.

So how `QuantileDMatrix`

really works? How this quantisation is defined mathematically?

How to reproduce:

```
import xgboost as xgb
import pandas as pd
import numpy as np
# define data with numpy
feature1 = np.array([1,2,3,4])
# put it into pandas
a = pd.DataFrame({'feature1': feature1})
quantized_a = xgb.QuantileDMatrix(a, max_bin = 4)
# to show that the behaviour is consistent both with pandas and numpy
quantized_feature1 = xgb.QuantileDMatrix(feature1.reshape(-1, 1), max_bin = 4)
print(quantized_a.get_data().data)
print(quantized_feature1.get_data().data)
# output: [-1.e-05, 2.e+00, 3.e+00, 4.e+00 ]
# different data yields similar problem
feature2 = np.array([10399., 34552., -48585., 70.])
quantized_feature2 = xgb.QuantileDMatrix(feature2.reshape(-1, 1), max_bin = 4)
print(quantized_feature2.get_data().data)
np.testing.assert_almost_equal(feature2, quantized_feature2.get_data().data)
# Arrays are not almost equal to 7 decimals
# Mismatched elements: 1 / 4 (25%)
# Max absolute difference: 48585.
# Max relative difference: 0.5
# x: array([ 10399., 34552., -48585., 70.])
# y: array([ 1.0399e+04, 3.4552e+04, -9.7170e+04, 7.0000e+01], dtype=float32)
# in this case -48686 is the value affected, the lowest.
# If you make it positive, then the value affected
# is 70 which becomes the lowest one
```

Here are the requirements:

```
xgboost>=1.7.6
numpy>=1.23.5
pandas>=1.5.7
```

Solution

Each data point is actually replaced with the lower bound of each quantile bin. For the smallest bin, the lower bound is `-inf`

. But instead of `-inf`

, the developers use `min(2x, 0)-1.e-05`

. However, the developers acknowledged that `min(2x, 0)-1.e-05`

is not really a good surrogate for `-inf`

and `-inf`

should be used directly [1].

Open issues on Github related to this function:

- Python Jinja2 LaTeX Table
- Getting attributes of a class
- How can I print many significant figures in Python?
- How to allow list append() method to return the new list
- Calculate Last Friday of Month in Pandas
- Python type hint for Iterable[str] that isn't str
- How to iterate over a list in chunks
- How to exit the entire application from a Python thread?
- Running shell command and capturing the output
- How do I pass a variable by reference?
- Convert range(r) to list of strings of length 2 in python
- How can I get the start and end dates for each week?
- how to use send_message() in python-telegram-bot
- Python conditional replacement based on element type
- How can I count the number of items in an arbitrary iterable (such as a generator)?
- Find longest consecutive range of numbers in list
- Insert text in braces with asyncpg
- How does one put a link / url to the web-site's home page in Django?
- How to determine if a path is a subdirectory of another?
- Custom Keybindings for Ipython terminal
- FastAPI asynchronous background tasks blocks other requests?
- How to make sure that information from one file is duplicated into several text documents, without specific lines
- Installing a Python environment with Anaconda
- sklearn pipeline model predicting same results for all input
- Brew command not found after installing Anaconda Python
- How to get an XPath from selenium webelement or from lxml?
- Pipe PuTTY console to Python script
- How to align the axes of a figure in matplotlib?
- Persist ParentDocumentRetriever of langchain
- How to reset index in a pandas dataframe?