Update XGBoost Model With New Data Using Native API

In many real-world scenarios, data arrives in batches over time.

Retraining a model from scratch every time new data becomes available can be computationally expensive and time-consuming, especially for large datasets. XGBoost’s native API supports incremental learning, allowing you to efficiently update an existing model with new training data without starting from scratch.

This example demonstrates how to update an XGBoost model with new data using the native API, saving computational resources and time.

Generally, the XGBoost algorithm assumes the entire training data set is available from the beginning, in order to make optimal decisions. Updating a model with new or different data at a later dime violates the assumptions of the underlying algorithm and could lead to unexpected behavior.

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import xgboost as xgb

# Generate synthetic binary classification dataset
X, y = make_classification(n_samples=10000, n_classes=2, random_state=42)

# Split data into initial training set, additional training set, and test set
X_train_init, X_train_new, y_train_init, y_train_new = train_test_split(X, y, test_size=0.2, random_state=42)
X_train_new, X_test, y_train_new, y_test = train_test_split(X_train_new, y_train_new, test_size=0.5, random_state=42)

# Create DMatrix for initial training data
dtrain_init = xgb.DMatrix(X_train_init, label=y_train_init)

# Train initial model
params = {'objective': 'binary:logistic', 'max_depth': 3, 'eta': 0.1, 'seed': 42}
num_rounds = 100
model = xgb.train(params, dtrain_init, num_rounds)

# Evaluate the model
dtest = xgb.DMatrix(X_test, label=y_test)
predictions = model.predict(dtest)
accuracy = np.mean((predictions > 0.5) == y_test)
print(f"Test accuracy: {accuracy:.4f}")

# Create DMatrix for additional training data
dtrain_new = xgb.DMatrix(X_train_new, label=y_train_new)

# Update model with new data
num_boost_rounds = 50
model = xgb.train(params, dtrain_new, num_boost_rounds, xgb_model=model)

# Evaluate updated model
predictions = model.predict(dtest)
accuracy = np.mean((predictions > 0.5) == y_test)
print(f"Updated test accuracy: {accuracy:.4f}")

Here’s how the code works:

We generate a synthetic binary classification dataset using scikit-learn’s make_classification function. In practice, you would use your actual training data.
We split the data into an initial training set (X_train_init, y_train_init), an additional training set (X_train_new, y_train_new), and a test set (X_test, y_test).
We create an xgb.DMatrix object (dtrain_init) for the initial training data.
We train an initial XGBoost model using xgb.train with specified parameters (params) and number of rounds (num_rounds) and report the models accuracy.
We create another xgb.DMatrix object (dtrain_new) for the additional training data.
We update the existing model with the new data using xgb.train, passing the existing model via the xgb_model parameter and specifying the number of additional boosting rounds (num_boost_rounds).
Finally, we evaluate the updated model on the test set. We create an xgb.DMatrix object (dtest) for the test data, make predictions using model.predict, and calculate the accuracy.

By leveraging XGBoost’s incremental learning capability, you can efficiently update your models with new data as it becomes available, without the need to retrain from scratch. This approach is particularly beneficial when dealing with large datasets or scenarios where data arrives in batches over time, as it can significantly reduce computational costs and improve the model’s performance by incorporating the latest information.

See Also