XGBoost Batch Training

Training an XGBoost model in batches of rounds, known as batch learning, allows you to observe the model’s performance improvement after each batch.

This example demonstrates how to train an XGBoost classifier in batches while reporting the training and test accuracy scores after each batch of rounds.

Monitoring the model’s performance during batch training provides insights into how the model learns over time and can help identify an optimal number of training rounds to prevent overfitting.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier

# Generate synthetic binary classification dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=8, n_redundant=2,
                           n_classes=2, weights=[0.6, 0.4], random_state=42)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize XGBoost classifier
num_estimators_per_batch = 5
model = XGBClassifier(n_estimators=num_estimators_per_batch, random_state=42)
model.fit(X_train, y_train)

# Train model in batches of rounds
num_batches = 20
for i in range(num_batches):
    model.fit(X_train, y_train, xgb_model=model.get_booster())

    # Make predictions on train and test data
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # Calculate and print accuracy scores
    train_accuracy = accuracy_score(y_train, y_train_pred)
    test_accuracy = accuracy_score(y_test, y_test_pred)
    print(f"Batch {i+1}/{num_batches} - Train Accuracy: {train_accuracy:.4f}, Test Accuracy: {test_accuracy:.4f}")

Let’s break down the key steps:

Generate a synthetic binary classification dataset using make_classification from scikit-learn.
Split the data into training and test sets.
Initialize an XGBoost classifier with n_estimators=num_estimators_per_batch, specifying the number of boosting rounds per batch.
Train the model in batches of rounds for num_batches iterations:
- For each batch, call model.fit with the xgb_model parameter set to the current booster, allowing the model to continue training from its previous state.
- Make predictions on the training and test data using the model at its current state.
- Calculate the training and testing accuracy scores using accuracy_score from scikit-learn.
- Print the batch number and the corresponding accuracy scores.

The output will display the training and testing accuracy for each batch of rounds, allowing you to monitor how the model’s performance improves after each batch.

Note that the specific hyperparameters used for the XGBoost model, the number of estimators per batch (num_estimators_per_batch), and the total number of batches (num_batches) can be adjusted based on your specific dataset and requirements.

By training the model in batches and evaluating its performance after each batch, you can gain insights into the model’s learning progress and potentially identify an optimal stopping point to prevent overfitting.

See Also