Train an XGBoost Model on a DMatrix With Native API

XGBoost’s DMatrix is an optimized data structure that can efficiently hold both dense and sparse data.

By using DMatrix, you can load your data into XGBoost and train your model with optimal memory efficiency and training speed.

from xgboost import DMatrix, train
import numpy as np

# Assuming X and y are NumPy arrays
X = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
y = np.array([0, 1, 1])

# Create DMatrix from X and y
data_dmatrix = DMatrix(data=X, label=y)

# Set XGBoost parameters
params = {
    'objective': 'binary:logistic',
    'learning_rate': 0.1,
    'random_state': 42
}

# Train the model
model = train(params, data_dmatrix)

Here’s what’s happening:

We assume that our feature matrix X and target vector y are NumPy arrays.
We create a DMatrix object called data_dmatrix from X and y. DMatrix is the internal data structure used by XGBoost for both training and making predictions. It’s designed to handle data in a way that’s optimized for XGBoost’s learning algorithms.
We set the XGBoost parameters using a dictionary params. Here, we specify the objective function (binary logistic for binary classification), learning rate, and random seed. These parameters can be tuned for your specific use case.
We train the model by passing the params dictionary and data_dmatrix to the train function. This function is part of XGBoost’s native API and handles the actual model training process.

By using DMatrix, we can ensure that our data is in the optimal format for XGBoost, which can lead to faster training times and more efficient memory usage compared to other data formats.

Remember that while DMatrix is the recommended format for XGBoost, you can still use other data formats like NumPy arrays, Pandas DataFrames, or even datasets from scikit-learn with XGBoost’s scikit-learn compatible API.

See Also