XGBoost Incremental Training

Incrementally training an XGBoost model round-by-round allows you to monitor its performance improvement over time and potentially identify an optimal number of training iterations.

This example demonstrates how to train an XGBoost classifier incrementally, one round at a time, while reporting the training and testing accuracy after each round.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier

# Generate synthetic binary classification dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=8, n_redundant=2,
                           n_classes=2, weights=[0.6, 0.4], random_state=42)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize XGBoost classifier
model = XGBClassifier(n_estimators=1, random_state=42)
model.fit(X_train, y_train)

# Train model incrementally for multiple rounds
num_rounds = 10
for i in range(num_rounds):
    model.fit(X_train, y_train, xgb_model=model.get_booster())

    # Make predictions on train and test data
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # Calculate and print accuracy scores
    train_accuracy = accuracy_score(y_train, y_train_pred)
    test_accuracy = accuracy_score(y_test, y_test_pred)
    print(f"Round {i+1}/{num_rounds} - Train Accuracy: {train_accuracy:.4f}, Test Accuracy: {test_accuracy:.4f}")

Let’s break down the key steps:

Generate a synthetic binary classification dataset using make_classification from scikit-learn.
Split the data into training and test sets.
Initialize an XGBoost classifier with n_estimators=1, meaning each fit call will perform one boosting iteration.
Train the model incrementally for num_rounds iterations:
- For each round, call model.fit with the xgb_model parameter set to the current booster, allowing the model to continue training from its previous state.
- Make predictions on the training and test data using the model at its current state.
- Calculate the training and testing accuracy scores using accuracy_score from scikit-learn.
- Print the round number and the corresponding accuracy scores.

The output will display the training and testing accuracy for each round, allowing you to observe how the model’s performance improves with each additional training iteration.

Note that the specific hyperparameters used for the XGBoost model and the number of training rounds (num_rounds) can be adjusted based on your specific dataset and requirements.

By incrementally training the model and evaluating its performance at each round, you can gain insights into how the model learns over time and potentially identify an optimal stopping point to prevent overfitting.

See Also