Train an XGBoost Model on a Dataset Stored in Lists

When your data is stored in Python lists rather than numpy arrays or pandas DataFrames, you’ll need to convert it before training an XGBoost model.

XGBoost’s DMatrix class provides an efficient way to convert list data into the format required by the train() function.

from xgboost import DMatrix, train

# Assuming X and y are lists
X = [
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
]
y = [0, 1, 1]

# Create DMatrix from X and y
data_dmatrix = DMatrix(data=X, label=y)

# Set XGBoost parameters
params = {
    'objective': 'binary:logistic',
    'learning_rate': 0.1,
    'random_state': 42
}

# Train the model
model = train(params, data_dmatrix)

Here’s what’s happening:

We assume that our feature matrix X and target vector y are stored as Python lists. Remember to ensure that X and y have compatible dimensions before proceeding.
We use XGBoost’s DMatrix class to convert our list data into a format optimized for XGBoost. DMatrix is an XGBoost-specific data structure that is designed for both memory efficiency and training speed.
We set the XGBoost parameters using a dictionary params. Here, we specify the objective function (binary logistic for binary classification), learning rate, and random seed. These parameters can be tuned for your specific use case.
We train the model by passing the params dictionary and data_dmatrix to the train function. This function is part of XGBoost’s native API and handles the actual model training process.

By leveraging DMatrix, you can efficiently train XGBoost models even when your data is initially stored in Python lists. This approach can be especially handy when you’re dealing with data that isn’t already in a numpy or pandas format.

See Also