XGBoosting Home | About | Contact | Examples

What is a QuantileDMatrix in XGBoost

QuantileDMatrix is a specialized data structure in XGBoost designed for handling quantized datasets for XGBoost models that use the 'hist' tree method.

This example shows how to create and use QuantileDMatrix, comparing its performance and use cases with the standard DMatrix.

For scenarios where you need to handle large datasets or perform distributed training, QuantileDMatrix offers optimized memory usage and performance.

Here’s how you can create and use QuantileDMatrix:

import numpy as np
from xgboost import DMatrix, QuantileDMatrix, train
import time
import sys

# Generate synthetic data
X = np.random.rand(1000000, 10)
y = np.random.randint(2, size=1000000)

# Report details of the generated data
print(X[:5, :])
print(y[:5])

# Create DMatrix from NumPy arrays
dmatrix = DMatrix(data=X, label=y)

# Create QuantileDMatrix from NumPy arrays
quantile_dmatrix = QuantileDMatrix(data=X, label=y)

# Compare memory usage
print(f"DMatrix memory usage: {sys.getsizeof(dmatrix)} bytes")
print(f"QuantileDMatrix memory usage: {sys.getsizeof(quantile_dmatrix)} bytes")

# Define training parameters
params = {
    "objective": "binary:logistic",
    "tree_method": "hist",
    "max_depth": 3}

# Train with DMatrix
start_time = time.time()
bst_dmatrix = train(params, dmatrix, num_boost_round=10)
print(f"Training time with DMatrix: {time.time() - start_time:.2f} seconds")

# Train with QuantileDMatrix
start_time = time.time()
bst_quantile_dmatrix = train(params, quantile_dmatrix, num_boost_round=10)
print(f"Training time with QuantileDMatrix: {time.time() - start_time:.2f} seconds")

In this example:

  1. We generate a synthetic dataset using NumPy and create a DMatrix object and a QuantileDMatrix object from the same dataset.

  2. We compare the memory usage of both DMatrix and QuantileDMatrix to highlight the efficiency of QuantileDMatrix.

  3. We train an XGBoost model using both DMatrix and QuantileDMatrix, comparing the training times to illustrate the performance benefits.

QuantileDMatrix is particularly useful for large-scale distributed training. Use it when working with large datasets to optimize memory usage and performance. By following this example, you can understand when and how to use QuantileDMatrix, leveraging its benefits for specific use cases in your data science workflows.



See Also