In real-world scenarios, data often arrives in batches over time. Retraining an XGBoost model from scratch every time new data becomes available can be computationally expensive and time-consuming.
This example demonstrates how to leverage XGBoost’s incremental learning capability to efficiently update an existing model with new data, saving computational resources and time compared to retraining from scratch.
Note that the XGBoost algorithm assumes the entire training data set is available from the beginning, in order to make optimal decisions. Updating a model with new or different data at a later dime violates the assumptions of the underlying algorithm and could lead to unexpected behavior.
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import xgboost as xgb
# Generate synthetic binary classification dataset
X, y = make_classification(n_samples=10000, n_classes=2, random_state=42)
# Split data into initial training set, additional training set, and test set
X_train_init, X_train_new, y_train_init, y_train_new = train_test_split(X, y, test_size=0.2, random_state=42)
X_train_new, X_test, y_train_new, y_test = train_test_split(X_train_new, y_train_new, test_size=0.5, random_state=42)
# Create DMatrix for initial training data
dtrain_init = xgb.DMatrix(X_train_init, label=y_train_init)
# Train initial model
params = {'objective': 'binary:logistic', 'max_depth': 3, 'eta': 0.1, 'seed': 42}
num_rounds = 100
model = xgb.train(params, dtrain_init, num_rounds)
# Evaluate the initial model
dtest = xgb.DMatrix(X_test, label=y_test)
predictions = model.predict(dtest)
accuracy = np.mean((predictions > 0.5) == y_test)
print(f"Initial test accuracy: {accuracy:.4f}")
# Create DMatrix for additional training data
dtrain_new = xgb.DMatrix(X_train_new, label=y_train_new)
# Update model with new data
num_boost_rounds = 50
model = xgb.train(params, dtrain_new, num_boost_rounds, xgb_model=model)
# Evaluate updated model
predictions = model.predict(dtest)
accuracy = np.mean((predictions > 0.5) == y_test)
print(f"Updated test accuracy: {accuracy:.4f}")
In this example:
We generate a synthetic binary classification dataset using scikit-learn’s
make_classification
function for demonstration purposes. In practice, you would use your actual training data.We split the data into an initial training set (
X_train_init
,y_train_init
), an additional training set (X_train_new
,y_train_new
), and a test set (X_test
,y_test
).We create an
xgb.DMatrix
object (dtrain_init
) for the initial training data and train an initial XGBoost model usingxgb.train
with specified parameters (params
) and number of rounds (num_rounds
).We evaluate the initial model’s performance on the test set by creating an
xgb.DMatrix
object (dtest
) for the test data, making predictions usingmodel.predict
, and calculating the accuracy.To update the model with new data, we create another
xgb.DMatrix
object (dtrain_new
) for the additional training data.We update the existing model with the new data using
xgb.train
, passing the existing model via thexgb_model
parameter and specifying the number of additional boosting rounds (num_boost_rounds
).Finally, we evaluate the updated model’s performance on the test set to assess the impact of incorporating the new data.
By leveraging XGBoost’s incremental learning capability, you can efficiently update your models with new data as it becomes available, without the need to retrain from scratch. This approach can significantly reduce computational costs and improve the model’s performance by incorporating the latest information.