When working with very large datasets that exceed available RAM, you can still train XGBoost models effectively by leveraging disk space.
XGBoost’s DMatrix
allows data to be loaded from external memory, enabling training on datasets that are too large to fit into memory.
Here’s a quick example of how you can use DMatrix
to train an XGBoost model while utilizing external memory:
import xgboost as xgb
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# Generate synthetic data
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Save the dataset to a LibSVM file
with open('train.libsvm', 'w') as f:
for i in range(X_train.shape[0]):
label = y_train[i]
features = X_train[i, :]
line = str(label) + " " + " ".join(f"{j+1}:{features[j]}" for j in range(len(features)))
f.write(line + "\n")
# Load data into DMatrix with external memory, specifying the format as LibSVM
data_path = "train.libsvm?format=libsvm#dtrain.cache"
dtrain = xgb.DMatrix(data_path)
# Define training parameters
params = {
'objective': 'binary:logistic',
'eval_metric': 'auc',
'max_depth': 6,
'eta': 0.3,
'seed': 42
}
# Train the model
bst = xgb.train(params, dtrain, num_boost_round=10)
# Predictions can be made on test data
dtest = xgb.DMatrix(X_test)
y_pred = bst.predict(dtest)
# Printing the first 5 predictions
print(y_pred[:5])
By initializing the DMatrix
with a reference to an external data file ('train.libsvm'
in this case), XGBoost can load data in batches as needed during training, allowing you to work with datasets larger than your machine’s memory.
The ?format=libsvm#dtrain.cache
specifies that the data is in LibSVM format. The #dtrain.cache
is a suffix that specifies a cache file for storing some internal data structures, helping to speed up future uses of this DMatrix
.
With the DMatrix
created, you can proceed with setting up the XGBoost parameters and training the model as usual.