Train XGBoost with DMatrix External Memory

When working with very large datasets that exceed available RAM, you can still train XGBoost models effectively by leveraging disk space.

XGBoost’s DMatrix allows data to be loaded from external memory, enabling training on datasets that are too large to fit into memory.

Here’s a quick example of how you can use DMatrix to train an XGBoost model while utilizing external memory:

import xgboost as xgb
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Generate synthetic data
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Save the dataset to a LibSVM file
with open('train.libsvm', 'w') as f:
    for i in range(X_train.shape[0]):
        label = y_train[i]
        features = X_train[i, :]
        line = str(label) + " " + " ".join(f"{j+1}:{features[j]}" for j in range(len(features)))
        f.write(line + "\n")

# Load data into DMatrix with external memory, specifying the format as LibSVM
data_path = "train.libsvm?format=libsvm#dtrain.cache"
dtrain = xgb.DMatrix(data_path)

# Define training parameters
params = {
    'objective': 'binary:logistic',
    'eval_metric': 'auc',
    'max_depth': 6,
    'eta': 0.3,
    'seed': 42
}

# Train the model
bst = xgb.train(params, dtrain, num_boost_round=10)

# Predictions can be made on test data
dtest = xgb.DMatrix(X_test)
y_pred = bst.predict(dtest)

# Printing the first 5 predictions
print(y_pred[:5])

By initializing the DMatrix with a reference to an external data file ('train.libsvm' in this case), XGBoost can load data in batches as needed during training, allowing you to work with datasets larger than your machine’s memory.

The ?format=libsvm#dtrain.cache specifies that the data is in LibSVM format. The #dtrain.cache is a suffix that specifies a cache file for storing some internal data structures, helping to speed up future uses of this DMatrix.

With the DMatrix created, you can proceed with setting up the XGBoost parameters and training the model as usual.

See Also