XGBoost’s DMatrix
is an optimized data structure that can efficiently hold both dense and sparse data.
By using DMatrix
, you can load your data into XGBoost and train your model with optimal memory efficiency and training speed.
from xgboost import DMatrix, train
import numpy as np
# Assuming X and y are NumPy arrays
X = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
y = np.array([0, 1, 1])
# Create DMatrix from X and y
data_dmatrix = DMatrix(data=X, label=y)
# Set XGBoost parameters
params = {
'objective': 'binary:logistic',
'learning_rate': 0.1,
'random_state': 42
}
# Train the model
model = train(params, data_dmatrix)
Here’s what’s happening:
We assume that our feature matrix
X
and target vectory
are NumPy arrays.We create a
DMatrix
object calleddata_dmatrix
fromX
andy
.DMatrix
is the internal data structure used by XGBoost for both training and making predictions. It’s designed to handle data in a way that’s optimized for XGBoost’s learning algorithms.We set the XGBoost parameters using a dictionary
params
. Here, we specify the objective function (binary logistic for binary classification), learning rate, and random seed. These parameters can be tuned for your specific use case.We train the model by passing the
params
dictionary anddata_dmatrix
to thetrain
function. This function is part of XGBoost’s native API and handles the actual model training process.
By using DMatrix
, we can ensure that our data is in the optimal format for XGBoost, which can lead to faster training times and more efficient memory usage compared to other data formats.
Remember that while DMatrix
is the recommended format for XGBoost, you can still use other data formats like NumPy arrays, Pandas DataFrames, or even datasets from scikit-learn with XGBoost’s scikit-learn compatible API.